Notes:ΒΆ
There are 4 cells in this notebook rendering Dash interactive outputs (drop down user selection boxes).ΒΆ
1- Stacked Bar Chart- Dash Interactive App - Comparison of different Categories within Groups 2- Count Plot (Custom selection of Predictor) 3- Feature Importance method selection 4- Consensus Threshold Selection among feature importance selection techniques (1, 2, 3 or 4)
Please note, to enable only those 4 interactive cells, this notebook first needs to be hosted on some service providers platform as a standalone web application.ΒΆ
I have also provided alternate individual graphs in lieu of all these 4 interactive graphs in this notebook .ΒΆ
Notebook Guide (What all have been done?)ΒΆ
Data Preparation:ΒΆ
- Reading and Merging up of Input Master CSV File and Lookup (Codes) Excel File
A First Look at the Data:ΒΆ
- To understand Unique Values of Categorical Variables, Missing Values Etc.
Comprehensive Exploratory Analysis:ΒΆ
- Univariate Analysis (Count/Bar Plots, Line Plots, Scatter Plots, Pair Plots, Histograms and etc.)
- Bivariate Analysis (Heatmaps, Catplots etc.)
-Comparative Analysis
Comprehensive Descriptive AnalysisΒΆ
- Data Distribution, Skewness, Missing Value Analysis etc.
- This covers descriptive statistics and different visualisations(Box Plots, Overlay KDE Histograms etc.)
- Data missingness analysis (Contingency Table Analysis (Chi-Square Test)
- Individual variables analysis (Task Completion Time, and Repair Cost) -- (Q-Q Plot, Histogram, Box plots etc.)
- Relationship between predictor(Task Completion Time) and Response(Repair Cost) ---
Unsupervised Data Analysis:ΒΆ
Note- This has been done on a very limited scale due to multi-level (High dimensional) categorical values of different predictors, and potential reduction of clustering precision with dimensionality reduction techniques.
- DBSCAN Clustering (Repair Costs based on Job Status and Initial Priority Description)
Data Imputation:ΒΆ
- Dropping of Records with missing values for "Date Comp" variable
- Reasoning behind this imputation
Feature EngineeringΒΆ
- Creation of new predictor ("Task_completion_time") using existing predictors ("Date Logged" and "Date Comp")
Feature Importance AnalysisΒΆ
- Random Forest Regressor, Permutation Importance, Anova F-Test, Recurxive Feature Ellimination
- Selection of predictors based on feature Consensus (i.e. voting) between 1, 2, 3, and all 4 feature selection techniques
Predictive Analytics (Supervised Modelling)ΒΆ
Linear RegressionΒΆ
- (All prerequsite Assumptions Analysis prior to modelling and post modelling diagnosis)
- Regularlization (L2) Ridge Regression for Collinearity Removal
Note- Data Linearity, Multicollinearity( VIF, Residual Plot/ Autocorrelation(ACF) Test, Q-Q Plot etc.
- This modelling has not been taken further due to failure of key assumptions during different stages
Ensemble ModellingΒΆ
Random ForestΒΆ
- Hyperparameter tuning, Model Training, Cross-Validation, Prediction
- Model Overfitting Analysis, Prediction accuracy (Using Loss, residual vs predicted, and actual vs prediction line curves)
- Model diagnosis using metrics (MSE, RMSE, MAE and R2)
Gradient BoostingΒΆ
(Similar to above Random Forest)
Notes on Modelling:ΒΆ
- Each model has been executed 4 times (based on feature Consensus between 1, 2, 3, and all 4 feature selection techniques)
- Also models have been executed with and without the inclusion of new predictor variable("Task_completion_time")
- Also one pair of Comparative Analysis execution for both models (Random Forest and Gradient Boosting)
- Comprehensive Model Performance Analysis between two types of ensemble models
- Explanation of Gradient Boosting (XGBoost) model's good performance over Random Forest
An Extra Final Attempt:ΒΆ
Time Series AnalysisΒΆ
- Time series assumptions check (Data Stationarity)
- Data Decomposition (for Trend, Seasonality, and Residual Analysis)
- Data differencing of different orders to make data stationary
Time Series ForecastingΒΆ
- ARIMA Modelling
- SARIMA Modelling
- Modelling Diagnosis
Note- This modelling has not been carried forward for inferencing due to time series assumptions failure in the presence of limited 18 months data.
Thank you.
# Import necessary libraries
# %matplotlib notebook
import pandas as pd
import numpy as np
import itertools
from scipy import stats
from itertools import combinations
from pandas.plotting import parallel_coordinates
from statsmodels.stats.multitest import multipletests
from scipy.stats import chi2_contingency
from statsmodels.tsa.arima.model import ARIMA
from scipy.stats import chi2_contingency
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
from sklearn.feature_selection import f_classif
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.cluster import DBSCAN
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy.stats import spearmanr
from scipy.stats import linregress
from sklearn.model_selection import train_test_split, RandomizedSearchCV, KFold
from sklearn.metrics import mean_squared_error
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, r2_score, mean_absolute_percentage_error
from sklearn.inspection import permutation_importance
from sklearn.tree import plot_tree
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error, r2_score
from sklearn.linear_model import Ridge
from statsmodels.stats.outliers_influence import variance_inflation_factor
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.compose import ColumnTransformer
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
import matplotlib.pyplot as plt
from matplotlib.dates import MonthLocator
from matplotlib.colors import ListedColormap
from matplotlib.patches import Patch
import seaborn as sns
import os, csv
import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import plotly.express as px
import plotly.graph_objects as go
from dash import dcc, html, Input, Output
from plotly.subplots import make_subplots
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.sarimax import SARIMAX
import warnings
# Specify the path to the new working directory
new_directory = "C:/Files/Glasgow expenses/UofG_Sem2/Task_Rside/Input_Files"
# Change the working directory
os.chdir(new_directory)
# Verify the change
current_directory = os.getcwd()
print("Current working directory:", current_directory)
# Reading CSV file using pandas
csv_file_path = "Int.csv"
try:
# Read the CSV file into a DataFrame
Int_df = pd.read_csv(csv_file_path, encoding='utf-8', keep_default_na=False, parse_dates=['Date Logged', 'Date Comp', 'Day of Date Logged'], dayfirst=True)
# Display the DataFrame or perform other operations
print("CSV File Contents:")
print(Int_df.head())
except FileNotFoundError:
print(f"Error: The file {csv_file_path} was not found.")
except pd.errors.EmptyDataError:
print(f"Error: The file {csv_file_path} is empty or contains no data.")
except pd.errors.ParserError as e:
print(f"Error: Unable to parse the CSV file {csv_file_path}. Reason: {e}")
Current working directory: C:\Files\Glasgow expenses\UofG_Sem2\Task_Rside\Input_Files
CSV File Contents:
Job No Job Type JOB_TYPE_DESCRIPTION CONTRACTOR Year of Build Date \
0 1523686 RREP Responsive Repairs N/A 2021
1 1517771 RREP Responsive Repairs N/A 2021
2 2085766 GASR Gas Responsive Repairs N/A 2021
3 2089539 RREP Responsive Repairs N/A 2021
4 1696509 RREP Responsive Repairs N/A 2021
Jobsourcedescription Property Ref Property Type Initial Priority \
0 CSC Phone Call ID_209 Semi Detached 2
1 CSC Phone Call ID_209 Semi Detached 1
2 CSC Phone Call ID_209 Semi Detached 1
3 CSC Phone Call ID_209 Semi Detached 2
4 CSC Phone Call ID_209 Semi Detached 1
Initial Priority Description ... LATEST_PRIORITY ABANDON_REASON_CODE \
0 Urgent PFI Evolve RD Irvine EMB ... 2
1 Emergency ... 1
2 Emergency ... 1
3 Urgent PFI Evolve RD Irvine EMB ... 2
4 Emergency ... 1
Day of Date Logged SOR_CODE SOR_DESCRIPTION \
0 2022-09-02 390903 LOCK:RENEW MORTICE COMPLETE
1 2022-08-26 39004 DRAIN:JET BLOCKAGE (RTR WITHIN 12HRS)
2 2023-11-26 199998 OUT OF HOURS (NOT FTF)
3 2023-11-28 830009 SHOWER:RECONNECT AND TEST
4 2023-01-24 620515 SHOWER:CLEAR BLOCKAGE INCLUDING REMOVE
Date Logged Mgt Area TRADE_DESCRIPTION Date Comp Total Value
0 2022-09-02 MA1 Carpenter 2022-09-08 100.00
1 2022-08-26 MA1 Drainage Works 2022-08-26 267.12
2 2023-11-26 MA1 Out of Hours Work NaT 88.45
3 2023-11-28 MA1 Electrician NaT 36.63
4 2023-01-24 MA1 Plumbing 2023-01-24 100.00
[5 rows x 21 columns]
Int_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21286 entries, 0 to 21285 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Job No 21286 non-null int64 1 Job Type 21286 non-null object 2 JOB_TYPE_DESCRIPTION 21286 non-null object 3 CONTRACTOR 21286 non-null object 4 Year of Build Date 21286 non-null int64 5 Jobsourcedescription 21286 non-null object 6 Property Ref 21286 non-null object 7 Property Type 21286 non-null object 8 Initial Priority 21286 non-null object 9 Initial Priority Description 21286 non-null object 10 Job Status 21286 non-null int64 11 LATEST_PRIORITY 21286 non-null object 12 ABANDON_REASON_CODE 21286 non-null object 13 Day of Date Logged 21286 non-null datetime64[ns] 14 SOR_CODE 21286 non-null object 15 SOR_DESCRIPTION 21286 non-null object 16 Date Logged 21286 non-null datetime64[ns] 17 Mgt Area 21286 non-null object 18 TRADE_DESCRIPTION 21286 non-null object 19 Date Comp 20496 non-null datetime64[ns] 20 Total Value 21286 non-null float64 dtypes: datetime64[ns](3), float64(1), int64(3), object(14) memory usage: 3.4+ MB
# Excel file path
excel_file_path = "C:/Files/Glasgow expenses/UofG_Sem2/Task_Rside/Input_Files/Supporting_data.xlsx"
try:
# Read the "Abandon_Reason" worksheet into a DataFrame
Abandon_Reason_df = pd.read_excel(excel_file_path, sheet_name="Abandon_Reason", header=0, index_col=0)
# Display or perform operations on the Input_Error DataFrame
print("DataFrame for Abandon_Reason:")
print(Abandon_Reason_df.head())
except FileNotFoundError:
print(f"Error: The file {excel_file_path} was not found.")
except pd.errors.EmptyDataError:
print(f"Error: The file {excel_file_path} is empty or contains no data.")
except pd.errors.ParserError as e:
print(f"Error: Unable to parse the Excel file {excel_file_path}. Reason: {e}")
try:
# Read the "Job_Status_Description" worksheet into a DataFrame
job_description_df = pd.read_excel(excel_file_path, sheet_name="Job_Status_Description", header=0, index_col=0)
# Display or perform operations on the Job_Status_Description DataFrame
print("\nDataFrame for Job_Status_Description:")
print(job_description_df.head())
except FileNotFoundError:
print(f"Error: The file {excel_file_path} was not found.")
except pd.errors.EmptyDataError:
print(f"Error: The file {excel_file_path} is empty or contains no data.")
except pd.errors.ParserError as e:
print(f"Error: Unable to parse the Excel file {excel_file_path}. Reason: {e}")
DataFrame for Abandon_Reason:
ABANDON_REASON_DESC
ABANDON_REASON_CODE
IE Input Error
DUP Duplicate Order
NWK No Work Required
TM Tenant Missed Appt
NaN No Access
DataFrame for Job_Status_Description:
JOB_STATUS_DESCRIPTION
Job Status
1 Note Job
2 Pre-Inspection
6 Job Logged
90 Work Completed
92 Abandoned
print(job_description_df)
print(Abandon_Reason_df)
unique_values = Abandon_Reason_df['ABANDON_REASON_DESC'].unique()
print(unique_values)
# Check unique values for 'ABANDON_REASON_CODE' in the master file 'Int_df'
unique_abandon_reason_codes = Int_df['ABANDON_REASON_CODE'].unique()
# Display the unique codes
print(unique_abandon_reason_codes)
JOB_STATUS_DESCRIPTION
Job Status
1 Note Job
2 Pre-Inspection
6 Job Logged
90 Work Completed
92 Abandoned
93 Invoice Accepted
ABANDON_REASON_DESC
ABANDON_REASON_CODE
IE Input Error
DUP Duplicate Order
NWK No Work Required
TM Tenant Missed Appt
NaN No Access
WCC Wrong Contractor
AJ Alternative Job
IN Inspection Not Required
AP Added to Planned Programme
TR Tenant Refusal
AB Abortive Call
TEST Testing
DCU Data Clean Up
NAP Riverside Not Approved
SEE See Repair Memo
WG Work Under Guarantee
WD Work Deferred
NC No Charge
CA Contractor Link Reason
['Input Error' 'Duplicate Order' 'No Work Required' 'Tenant Missed Appt'
'No Access' 'Wrong Contractor' 'Alternative Job'
'Inspection Not Required' 'Added to Planned Programme' 'Tenant Refusal'
'Abortive Call' 'Testing' 'Data Clean Up' 'Riverside Not Approved'
'See Repair Memo' 'Work Under Guarantee' 'Work Deferred' 'No Charge'
'Contractor Link Reason']
['' 'IE' 'DUP' 'NWK' 'TM' 'NA' 'WCC' 'AJ' 'IN' 'AP' 'TR' 'AB' 'TEST' 'DCU'
'NAP' 'SEE' 'WG' 'WD' 'NC' 'CA']
print(Abandon_Reason_df.index.name)
print(job_description_df.index.name)
ABANDON_REASON_CODE Job Status
Int_df.isnull().sum()
Job No 0 Job Type 0 JOB_TYPE_DESCRIPTION 0 CONTRACTOR 0 Year of Build Date 0 Jobsourcedescription 0 Property Ref 0 Property Type 0 Initial Priority 0 Initial Priority Description 0 Job Status 0 LATEST_PRIORITY 0 ABANDON_REASON_CODE 0 Day of Date Logged 0 SOR_CODE 0 SOR_DESCRIPTION 0 Date Logged 0 Mgt Area 0 TRADE_DESCRIPTION 0 Date Comp 790 Total Value 0 dtype: int64
# # Check unique values for 'ABANDON_REASON_CODE' and 'ABANDON_REASON_DESC' in the entire DataFrame
# unique_abandon_reasons_full = Int_df_merged[['ABANDON_REASON_CODE', 'ABANDON_REASON_DESC']].drop_duplicates()
# # Filter for rows where 'ABANDON_REASON_CODE' is 'NA'
# na_rows_full = unique_abandon_reasons_full[unique_abandon_reasons_full['ABANDON_REASON_CODE'] == 'NA']
# # Display the results
# print(na_rows_full)
# # Check unique values for 'ABANDON_REASON_CODE' and 'ABANDON_REASON_DESC' in the entire DataFrame
# unique_abandon_reasons_full = Int_df[['ABANDON_REASON_CODE']]
# # Filter for rows where 'ABANDON_REASON_CODE' is 'NA'
# na_rows_full = unique_abandon_reasons_full[unique_abandon_reasons_full['ABANDON_REASON_CODE'] == 'NA']
# # Display the results
# print(na_rows_full)
# Check unique values for 'ABANDON_REASON_CODE' in the master file 'Int_df'
unique_abandon_reason_codes = Int_df['ABANDON_REASON_CODE'].unique()
# Display the unique codes
print(unique_abandon_reason_codes)
# Assuming 'ABANDON_REASON_CODE' is the column of interest
na_count = (Int_df['ABANDON_REASON_CODE'] == 'NA').sum()
missing_count = Int_df['ABANDON_REASON_CODE'].eq('').sum()
# Display the counts
print("Count of 'NA':", na_count)
print("Count of missing values (empty strings):", missing_count)
# # Assuming 'ABANDON_REASON_CODE' is the column of interest
# na_count = Int_df['ABANDON_REASON_CODE'].eq('NA').sum()
# # missing_count = Int_df['ABANDON_REASON_CODE'].isna().sum()
# # Display the counts
# print("Count of 'NA':", na_count)
# print("Count of missing values (NaN):", missing_count)
['' 'IE' 'DUP' 'NWK' 'TM' 'NA' 'WCC' 'AJ' 'IN' 'AP' 'TR' 'AB' 'TEST' 'DCU' 'NAP' 'SEE' 'WG' 'WD' 'NC' 'CA'] Count of 'NA': 680 Count of missing values (empty strings): 17253
# Merge with 'ABANDON_REASON_CODE' column from Abandon_Reason_df
Int_df_merged = pd.merge(Int_df, Abandon_Reason_df, left_on='ABANDON_REASON_CODE', right_index=True, how='left')
# Replace 'No Access' description for "NA" codes
Int_df_merged['ABANDON_REASON_DESC'] = np.where(Int_df_merged['ABANDON_REASON_CODE'] == 'NA', 'No Access', Int_df_merged['ABANDON_REASON_DESC'])
# Set NaN for missing codes
Int_df_merged.loc[Int_df_merged['ABANDON_REASON_CODE'].isna(), 'ABANDON_REASON_DESC'] = np.nan
# Merge with 'Job Status' column from job_description_df
Int_df_merged = pd.merge(Int_df_merged, job_description_df, left_on='Job Status', right_on= 'Job Status', how='left')
# Create a mapping dictionary for Initial Priority codes and descriptions
priority_mapping = Int_df_merged.groupby('Initial Priority')['Initial Priority Description'].first().to_dict()
# Update the 'LATEST_PRIORITY' column based on the mapping
Int_df_merged['Latest Priority Description'] = Int_df_merged['LATEST_PRIORITY'].map(priority_mapping)
# Verify the updated DataFrame
print(Int_df_merged[['Initial Priority', 'Initial Priority Description', 'LATEST_PRIORITY', 'Latest Priority Description']].head())
# Int_df_merged['Initial Priority'] = Int_df_merged['Initial Priority'].astype('object')
# Int_df_merged['LATEST_PRIORITY'] = Int_df_merged['LATEST_PRIORITY'].astype('object')
# Int_df_merged['Job No'] = Int_df_merged['Job No']
# Int_df_merged['Job Status'] = Int_df_merged['Job Status'].astype('object')
Initial Priority Initial Priority Description LATEST_PRIORITY \
0 2 Urgent PFI Evolve RD Irvine EMB 2
1 1 Emergency 1
2 1 Emergency 1
3 2 Urgent PFI Evolve RD Irvine EMB 2
4 1 Emergency 1
Latest Priority Description
0 Urgent PFI Evolve RD Irvine EMB
1 Emergency
2 Emergency
3 Urgent PFI Evolve RD Irvine EMB
4 Emergency
# Int_df
# na_abandon_reason_desc = Int_df_merged[Int_df_merged['ABANDON_REASON_CODE'] == 'NA']['ABANDON_REASON_DESC'].unique()
# print(na_abandon_reason_desc)
# Check unique values for 'ABANDON_REASON_CODE' in the master file 'Int_df'
unique_abandon_reason_codes = Int_df_merged['ABANDON_REASON_CODE'].unique()
# Display the unique codes
print(unique_abandon_reason_codes)
# Check unique values for 'ABANDON_REASON_CODE' in the master file 'Int_df'
unique_abandon_reason_codes = Int_df_merged['ABANDON_REASON_DESC'].unique()
# Display the unique codes
print(unique_abandon_reason_codes)
# Count of unique codes
unique_codes_count = Int_df_merged['ABANDON_REASON_CODE'].nunique(dropna=False)
# Count of unique descriptions
unique_descriptions_count = Int_df_merged['ABANDON_REASON_DESC'].nunique(dropna=False)
print(f"Count of unique codes: {unique_codes_count}")
print(f"Count of unique descriptions: {unique_descriptions_count}")
# Filter rows with valid codes (excluding '')
valid_codes = Int_df_merged['ABANDON_REASON_CODE'].isin(['IE', 'DUP', 'NWK', 'TM', 'NA', 'WCC', 'AJ', 'IN', 'AP', 'TR', 'AB', 'TEST', 'DCU', 'NAP', 'SEE', 'WG', 'WD', 'NC', 'CA'])
# Check if these rows have any nan descriptions
nan_descriptions_valid_codes = Int_df_merged.loc[valid_codes, 'ABANDON_REASON_DESC'].isna().any()
print(f"Valid codes have any nan descriptions: {nan_descriptions_valid_codes}")
# Filter rows with empty string codes
empty_string_codes = Int_df_merged['ABANDON_REASON_CODE'] == ''
# Check if these rows have nan descriptions
nan_descriptions = Int_df_merged.loc[empty_string_codes, 'ABANDON_REASON_DESC'].isna().all()
print(f"Empty string codes have nan descriptions: {nan_descriptions}")
['' 'IE' 'DUP' 'NWK' 'TM' 'NA' 'WCC' 'AJ' 'IN' 'AP' 'TR' 'AB' 'TEST' 'DCU' 'NAP' 'SEE' 'WG' 'WD' 'NC' 'CA'] [nan 'Input Error' 'Duplicate Order' 'No Work Required' 'Tenant Missed Appt' 'No Access' 'Wrong Contractor' 'Alternative Job' 'Inspection Not Required' 'Added to Planned Programme' 'Tenant Refusal' 'Abortive Call' 'Testing' 'Data Clean Up' 'Riverside Not Approved' 'See Repair Memo' 'Work Under Guarantee' 'Work Deferred' 'No Charge' 'Contractor Link Reason'] Count of unique codes: 20 Count of unique descriptions: 20 Valid codes have any nan descriptions: False Empty string codes have nan descriptions: True
# Assuming 'Initial Priority' and 'Initial Priority Description' are the column names
unique_priorities = Int_df_merged[['Initial Priority', 'Initial Priority Description']].drop_duplicates()
# Sort the DataFrame by 'Initial Priority'
unique_priorities = unique_priorities.sort_values(by='Initial Priority')
# Print the unique codes and descriptions
print(unique_priorities)
# Assuming 'Initial Priority' and 'Initial Priority Description' are the column names
unique_priorities = Int_df_merged[['LATEST_PRIORITY', 'Latest Priority Description']].drop_duplicates()
# Sort the DataFrame by 'Initial Priority'
unique_priorities = unique_priorities.sort_values(by='LATEST_PRIORITY')
# Print the unique codes and descriptions
print(unique_priorities)
Initial Priority Initial Priority Description
794
416 0 Emergency Health and Safety
1 1 Emergency
4439 1
879 1 Emergency - 12 Calendar Hours
264 11
115 11 Pre Inspection 5 Working Days
674 13 Damp and Mould Inspection
146 13
5907 14 Major Responsive Repairs
17 15 Urgent GAS Evolve RD Irvine EMB
878 15 Urgent GAS - 3 Working Days
145 16 Damp and Mould Follow-On Work
0 2 Urgent PFI Evolve RD Irvine EMB
30 2
897 20 Health & Safety - Compliance - 4 Hours
4648 21 12 Calendar Hours
36 21 Emergency - Compliance - 12 Hours
161 22 3 Working Days - Compliance
383 23 Urgent - Compliance - 7 Calendar Days
209 24 7 Working Days - Compliance
726 25 10 Working Days - Compliance
527 26 28 Calendar Days - Compliance
776 27 38 Calendar Days - Compliance
7725 28 28 Calendar Days - Compliance
1122 28 56 Calendar Days - Compliance
940 29 76 Calendar Days - Compliance
9 3 Appointable
21 3
877 3 Appointable - 20 Working Days
4542 30 112 Calendar Days - Compliance
2300 31 335 Calendar Days - Compliance
16857 32 700 Calendar Days - Compliance
90 5 Discretionary
702 56 Section 11 Works
37 7 Three Day Void
38 9 Two Week Void
318 9
LATEST_PRIORITY Latest Priority Description
887
416 0 Emergency Health and Safety
1 1 Emergency
115 11 Pre Inspection 5 Working Days
146 13
5907 14 Major Responsive Repairs
17 15 Urgent GAS Evolve RD Irvine EMB
145 16 Damp and Mould Follow-On Work
0 2 Urgent PFI Evolve RD Irvine EMB
897 20 Health & Safety - Compliance - 4 Hours
36 21 Emergency - Compliance - 12 Hours
161 22 3 Working Days - Compliance
383 23 Urgent - Compliance - 7 Calendar Days
209 24 7 Working Days - Compliance
726 25 10 Working Days - Compliance
527 26 28 Calendar Days - Compliance
776 27 38 Calendar Days - Compliance
1122 28 56 Calendar Days - Compliance
940 29 76 Calendar Days - Compliance
9 3 Appointable
4542 30 112 Calendar Days - Compliance
2300 31 335 Calendar Days - Compliance
16857 32 700 Calendar Days - Compliance
90 5 Discretionary
702 56 Section 11 Works
37 7 Three Day Void
38 9 Two Week Void
Input data analysis - (including missing values and data mappings)ΒΆ
These are the initial observations from all the 23 variables in the dataframe .
1- There are some initial priortity codes that does not have any descriptions given. But all missing initial priority code values have all missing descriptions as expected. 2- We can see that one initial priority code is mapped to multiple descs., 3 - Contractor has 'N/A' codes- to be investigated later 4- Some property type codes (count =14) have zero codes. 5- Latest priority codes (=199) have missing values and since latest priorty codes have no descriptions, we have mapped these descriptions from initial priorty descriptions based on the matching lookup using initial priorty code. 6- 17253 "Abandon reason codes" are missing, and we have mapped correct descriptions based on the matching lookup csv file, and so 17253 matching descriptions are missing. 7- SOR_CODE and Descriptions (=207) values are missing. 8- TRADE_DESCRIPTION (=198) values missing. 9- Repair Completion Date (=790) values are missing. 10- "Day of Date Logged" and "Date Logged" are same fields - We have field redundancy here with no missing values. 11- Completion Date is Missing (790 Values), while date of logging is present for all repair tasks. 12- Total repair value is present for every job. 13-There are Number of zero values in 'Total Value': 7624 14. Number of near-zero values in 'Total Value': 7624
# Count the number of zero values
zero_values_count = (Int_df_merged['Total Value'] == 0).sum()
# Define a threshold for near-zero values
near_zero_threshold = 1 # You can adjust this threshold as needed
# Count the number of near-zero values
near_zero_values_count = (Int_df_merged['Total Value'].abs() < near_zero_threshold).sum()
# Print the counts
print(f"Number of zero values in 'Total Value': {zero_values_count}")
print(f"Number of near-zero values in 'Total Value': {near_zero_values_count}")
Number of zero values in 'Total Value': 7624 Number of near-zero values in 'Total Value': 7624
# Check missing value count for each column
# missing_values = Int_df_merged.isnull().sum()
missing_values = Int_df_merged.apply(lambda x: (x == '') | pd.isnull(x)).sum()
# Display the missing value count for each column
print("Missing Value Count for Each Column:")
print(missing_values)
# Plotting the horizontal bar graph
fig, ax = plt.subplots(figsize=(12, 8)) # Adjust width and height as needed
bars = ax.barh(missing_values.index, missing_values.values)
# Adding the missing value counts on top of the bars
for bar in bars:
yval = bar.get_width()
plt.text(yval, bar.get_y() + bar.get_height()/2, round(yval, 2), ha='left', va='center', weight='bold')
# Increasing the y-axis tick labels text size and bold
plt.yticks(fontsize=14) # Adjust fontsize and weight as needed
plt.xlabel('Number of Missing Values', fontsize=16, fontweight = 'bold')
plt.title('Missing Values in Each Column', fontsize = 16, fontweight = 'bold')
plt.show()
Missing Value Count for Each Column: Job No 0 Job Type 0 JOB_TYPE_DESCRIPTION 0 CONTRACTOR 0 Year of Build Date 0 Jobsourcedescription 0 Property Ref 0 Property Type 0 Initial Priority 217 Initial Priority Description 3982 Job Status 0 LATEST_PRIORITY 199 ABANDON_REASON_CODE 17253 Day of Date Logged 0 SOR_CODE 207 SOR_DESCRIPTION 207 Date Logged 0 Mgt Area 0 TRADE_DESCRIPTION 198 Date Comp 790 Total Value 0 ABANDON_REASON_DESC 17253 JOB_STATUS_DESCRIPTION 0 Latest Priority Description 691 dtype: int64
To understand whether the data is missing due to any of the below reasonsΒΆ
- MCAR: Missingness has no pattern and is unrelated to other variables.
- MAR: Missingness has a pattern and is related to other observed variables.
- MNAR: Missingness is related to unobserved data or the missing data itself.
missing_indicator_df = Int_df_merged.replace('', pd.NA).isna()
# Get all column names that have missing data
columns_with_missing_data = Int_df_merged.columns[Int_df_merged.replace('', pd.NA).isna().any()].tolist()
# List to store test results
test_results = []
# Iterate over all combinations of these columns
for col1, col2 in combinations(columns_with_missing_data, 2):
# Create a contingency table
contingency_table = pd.crosstab(missing_indicator_df[col1], missing_indicator_df[col2])
# Perform the chi-squared test
chi2, p, dof, expected = chi2_contingency(contingency_table)
# Store the results
test_results.append([col1, col2, chi2, p, dof])
# Convert results to DataFrame
test_results_df = pd.DataFrame(test_results, columns=['Var1', 'Var2', 'Chi2', 'P-Value', 'DoF'])
# Adjust the p-values for multiple comparisons using Bonferroni correction
adjusted_p = multipletests(test_results_df['P-Value'], method='bonferroni')
test_results_df['Adjusted P-Value'] = adjusted_p[1]
# Add 'Missingness_Type' column based on adjusted p-value
threshold = 0.05
test_results_df['Missingness_Type'] = test_results_df['Adjusted P-Value'].apply(lambda x: 'MAR or MNAR' if x < threshold else 'MCAR')
# Display the DataFrame with adjusted p-values and missingness type
print(test_results_df)
Var1 Var2 Chi2 \
0 Initial Priority Initial Priority Description 947.304748
1 Initial Priority LATEST_PRIORITY 19404.786129
2 Initial Priority ABANDON_REASON_CODE 2.802396
3 Initial Priority SOR_CODE 19028.150709
4 Initial Priority SOR_DESCRIPTION 19028.150709
5 Initial Priority TRADE_DESCRIPTION 19305.862796
6 Initial Priority Date Comp 223.797323
7 Initial Priority ABANDON_REASON_DESC 2.802396
8 Initial Priority Latest Priority Description 5433.439572
9 Initial Priority Description LATEST_PRIORITY 867.538655
10 Initial Priority Description ABANDON_REASON_CODE 22.195374
11 Initial Priority Description SOR_CODE 839.544122
12 Initial Priority Description SOR_DESCRIPTION 839.544122
13 Initial Priority Description TRADE_DESCRIPTION 863.111201
14 Initial Priority Description Date Comp 33.998717
15 Initial Priority Description ABANDON_REASON_DESC 22.195374
16 Initial Priority Description Latest Priority Description 1703.998045
17 LATEST_PRIORITY ABANDON_REASON_CODE 16.284831
18 LATEST_PRIORITY SOR_CODE 20351.919598
19 LATEST_PRIORITY SOR_DESCRIPTION 20351.919598
20 LATEST_PRIORITY TRADE_DESCRIPTION 20641.625433
21 LATEST_PRIORITY Date Comp 239.941788
22 LATEST_PRIORITY ABANDON_REASON_DESC 16.284831
23 LATEST_PRIORITY Latest Priority Description 5956.039152
24 ABANDON_REASON_CODE SOR_CODE 12.352946
25 ABANDON_REASON_CODE SOR_DESCRIPTION 12.352946
26 ABANDON_REASON_CODE TRADE_DESCRIPTION 16.088185
27 ABANDON_REASON_CODE Date Comp 190.506271
28 ABANDON_REASON_CODE ABANDON_REASON_DESC 21279.488785
29 ABANDON_REASON_CODE Latest Priority Description 51.084482
30 SOR_CODE SOR_DESCRIPTION 21182.285905
31 SOR_CODE TRADE_DESCRIPTION 20248.168137
32 SOR_CODE Date Comp 238.715959
33 SOR_CODE ABANDON_REASON_DESC 12.352946
34 SOR_CODE Latest Priority Description 5712.545285
35 SOR_DESCRIPTION TRADE_DESCRIPTION 20248.168137
36 SOR_DESCRIPTION Date Comp 238.715959
37 SOR_DESCRIPTION ABANDON_REASON_DESC 12.352946
38 SOR_DESCRIPTION Latest Priority Description 5712.545285
39 TRADE_DESCRIPTION Date Comp 229.979491
40 TRADE_DESCRIPTION ABANDON_REASON_DESC 16.088185
41 TRADE_DESCRIPTION Latest Priority Description 5802.270847
42 Date Comp ABANDON_REASON_DESC 190.506271
43 Date Comp Latest Priority Description 266.897787
44 ABANDON_REASON_DESC Latest Priority Description 51.084482
P-Value DoF Adjusted P-Value Missingness_Type
0 5.112410e-208 1 2.300584e-206 MAR or MNAR
1 0.000000e+00 1 0.000000e+00 MAR or MNAR
2 9.412354e-02 1 1.000000e+00 MCAR
3 0.000000e+00 1 0.000000e+00 MAR or MNAR
4 0.000000e+00 1 0.000000e+00 MAR or MNAR
5 0.000000e+00 1 0.000000e+00 MAR or MNAR
6 1.343142e-50 1 6.044141e-49 MAR or MNAR
7 9.412354e-02 1 1.000000e+00 MCAR
8 0.000000e+00 1 0.000000e+00 MAR or MNAR
9 1.118591e-190 1 5.033658e-189 MAR or MNAR
10 2.462664e-06 1 1.108199e-04 MAR or MNAR
11 1.363681e-184 1 6.136564e-183 MAR or MNAR
12 1.363681e-184 1 6.136564e-183 MAR or MNAR
13 1.026098e-189 1 4.617442e-188 MAR or MNAR
14 5.514844e-09 1 2.481680e-07 MAR or MNAR
15 2.462664e-06 1 1.108199e-04 MAR or MNAR
16 0.000000e+00 1 0.000000e+00 MAR or MNAR
17 5.449848e-05 1 2.452432e-03 MAR or MNAR
18 0.000000e+00 1 0.000000e+00 MAR or MNAR
19 0.000000e+00 1 0.000000e+00 MAR or MNAR
20 0.000000e+00 1 0.000000e+00 MAR or MNAR
21 4.049471e-54 1 1.822262e-52 MAR or MNAR
22 5.449848e-05 1 2.452432e-03 MAR or MNAR
23 0.000000e+00 1 0.000000e+00 MAR or MNAR
24 4.402911e-04 1 1.981310e-02 MAR or MNAR
25 4.402911e-04 1 1.981310e-02 MAR or MNAR
26 6.046007e-05 1 2.720703e-03 MAR or MNAR
27 2.465039e-43 1 1.109267e-41 MAR or MNAR
28 0.000000e+00 1 0.000000e+00 MAR or MNAR
29 8.847495e-13 1 3.981373e-11 MAR or MNAR
30 0.000000e+00 1 0.000000e+00 MAR or MNAR
31 0.000000e+00 1 0.000000e+00 MAR or MNAR
32 7.493537e-54 1 3.372092e-52 MAR or MNAR
33 4.402911e-04 1 1.981310e-02 MAR or MNAR
34 0.000000e+00 1 0.000000e+00 MAR or MNAR
35 0.000000e+00 1 0.000000e+00 MAR or MNAR
36 7.493537e-54 1 3.372092e-52 MAR or MNAR
37 4.402911e-04 1 1.981310e-02 MAR or MNAR
38 0.000000e+00 1 0.000000e+00 MAR or MNAR
39 6.023032e-52 1 2.710364e-50 MAR or MNAR
40 6.046007e-05 1 2.720703e-03 MAR or MNAR
41 0.000000e+00 1 0.000000e+00 MAR or MNAR
42 2.465039e-43 1 1.109267e-41 MAR or MNAR
43 5.383159e-60 1 2.422422e-58 MAR or MNAR
44 8.847495e-13 1 3.981373e-11 MAR or MNAR
# Create a pivot table for the heatmap
pivot_table = test_results_df.pivot(index='Var1', columns='Var2', values='Missingness_Type')
# Create a color map for the missingness types
cmap = ListedColormap(['green', 'red']) # Green for MCAR, red for MAR or MNAR
missingness_types = {'MCAR': 0, 'MAR or MNAR': 1}
# Replace the missingness types with integers for color mapping
pivot_table = pivot_table.replace(missingness_types)
# Plot the heatmap
plt.figure(figsize=(8, 6))
ax = sns.heatmap(pivot_table, cmap=cmap, annot=True)
plt.title('Heatmap of Missingness Type', fontweight="bold")
# Create a custom legend
legend_labels = [Patch(facecolor='green', label='MCAR'),
Patch(facecolor='red', label='MAR or MNAR')]
plt.legend(handles=legend_labels, bbox_to_anchor=(1.05, 1.4), loc='upper left', title='Missingness Type')
plt.tight_layout()
plt.show()
Implications for data imputation based on nature of missing data:ΒΆ
The approach to data imputation significantly depends on the nature of the missing data, which can be classified as MCAR (Missing Completely at Random), MAR (Missing at Random), or MNAR (Missing Not at Random).
MCAR (Missing Completely at Random):ΒΆ
Definition: The missingness of data is independent of both observed and unobserved data. The reasons for missing data are completely random and do not relate to the data.
Implication for Imputation:ΒΆ
Simple imputation methods can be effectively used since the missing data represents a random sample of the complete data. Methods include mean/mode/median imputation, random sampling from observed values, or more sophisticated techniques like K-Nearest Neighbors (KNN) imputation.
MAR (Missing at Random):ΒΆ
Definition: The missingness of data is related to the observed data but not to the missing data itself. The cause of the missing data is related to variables for which you have data.
Implication for Imputation:ΒΆ
Requires methods that model the probability of missingness based on observed data. Techniques include regression imputation, where missing values are predicted based on observed data, or more advanced methods like Multiple Imputation by Chained Equations (MICE), which iteratively models each variable with missing data conditional on the other variables.
MNAR (Missing Not at Random):ΒΆ
Definition: The missingness of data is related to the unobserved data, i.e., the reason for the missing data is related to the values that are missing.
Implication for Imputation:ΒΆ
The most complex case for imputation as it requires understanding the mechanism behind the missing data.Often requires external information or strong assumptions about the missing data mechanism.
Techniques can include model-based methods where the model includes terms for the missing data mechanism, or sensitivity analysis to understand how different assumptions about the missing data affect the results.
Grouping MAR and MNAR: Why Grouped: In practice, distinguishing between MAR and MNAR without external information is often difficult. Both require more sophisticated approaches than MCAR.
Implication for Analysis:ΒΆ
Indicates that the missing data cannot be assumed to be a random sample of the complete data and that the mechanism behind the missingness needs to be considered during imputation. Approach: Advanced imputation methods that incorporate models of the missingness mechanism or utilize additional information about the data.
Conclusion:ΒΆ
The choice of imputation method should be guided by the nature of the missing data. While MCAR allows for a wide range of imputation methods, MAR and MNAR require more careful consideration and sophisticated techniques to ensure that the imputation does not introduce significant biases or inaccuracies into the data analysis.
Task Date Completion Time (Missing Data Analysis)ΒΆ
The output of the proportions of missing 'Task_completion_time' across different 'JOB_TYPE_DESCRIPTION' categories reveals a pattern in the missingness that does not appear to be random. Here are some key observations and interpretations:
High Proportion of Missing Data in Specific Categories:ΒΆ
Categories like 'Tenant Doing Own Repair' and 'Play Equipment Inspections' have a 100% missing rate, which is very significant. This suggests that for certain job types, the completion time is consistently not recorded. 'Pre-Inspection', 'Play Equipment Repairs', and 'Asbestos Repairs Communal' also have high proportions of missing data. Variability in Missing Data Proportions:
There is a wide range in the proportion of missing data across job types. While some have very high missing rates, others like 'Responsive Repairs', 'Gas Responsive Repairs', and various inspection and repair categories have much lower missing rates. This variability indicates that the missingness is related to the type of job, which is a pattern that suggests it's not random.
Potential Reasons for Missing Data:ΒΆ
The pattern might be due to operational reasons. For example, certain job types may inherently take longer to complete or have more uncertainty in completion times, leading to more frequent missing entries. There could also be administrative practices or data recording protocols that vary by job type, affecting how often 'Task_completion_time' is recorded.
Implications for Analysis:ΒΆ
This non-random pattern of missing data suggests that simply excluding these records or imputing them without considering the job type might introduce bias into the analysis. Understanding why certain job types have higher rates of missing 'Task_completion_time' is important. This could inform how you handle these missing values and might also provide insights into operational aspects that could be optimized. Next Steps:
Investigate the operational or data recording processes for job types with high missingness to understand the reasons behind it. Consider stratified analysis where you handle missing data differently for different job types, or focus analysis on job types with more reliable data.
In summary, the distribution of missing data in 'Task_completion_time' across 'JOB_TYPE_DESCRIPTION' suggests a non-random pattern, likely influenced by the nature of the jobs and the data recording practices associated with them. This pattern should be taken into account in the data handling and analysis strategies.
# Check missing value count for each column
# missing_values = Int_df.apply(lambda x: (x == '') | pd.isnull(x)).sum()
missing_values = Int_df_merged.apply(lambda x: (x == '') | pd.isnull(x)).sum()
# # Display the missing value count for each column
# print("Missing Value Count for Each Column:")
# print(missing_values)
# Choose a different color for the bars (e.g., 'salmon')
bar_color = 'salmon'
# Create a horizontal bar chart
fig = go.Figure(go.Bar(
y=missing_values.index,
x=missing_values.values,
orientation='h', # Horizontal orientation
text=missing_values.values, # Display the values on the bars
textposition='auto', # Automatically position the text
marker_color=bar_color # Set the bar color
))
# Customize the layout
fig.update_layout(
xaxis_title='Number of Missing Values',
yaxis_title='Columns',
height=600, # Adjust height as needed
title=dict(
text='<b>Missing Values in Each Column </b>', # Use HTML tags for bold text
x=0.5, # Set x=0.5 to center the title
font=dict(
size=18, # Use the desired font size
color='black', # Set the font color
family='Arial', # Set the font family
)
)
)
# Show the plot
fig.show()
# Create a dictionary to store counts for each column
unique_counts = {}
# Iterate over columns and get unique value counts
for column in Int_df_merged.columns:
unique_counts[column] = Int_df_merged[column].nunique()
# Convert the dictionary to a DataFrame for better display
unique_counts_df = pd.DataFrame(list(unique_counts.items()), columns=['Column', 'Unique Counts'])
# Print or display the DataFrame
print(unique_counts_df)
# Create a bar plot using Plotly Express
bar_plot = px.bar(
unique_counts_df,
x='Column',
y='Unique Counts',
text='Unique Counts', # Display counts on top of the bars
labels={'Unique Counts': 'Count'},
title='Unique Value Counts for Each Column',
height=600, width=800
)
# Update layout for better readability
bar_plot.update_layout(
xaxis_title='Column',
yaxis_title='Count',
title=dict(text='<b> Unique Value Counts for Each Column</b>', x=0.5, y=0.95, font=dict(size=16, family='Arial')),
xaxis=dict(tickangle=45, tickmode='array'),
yaxis=dict(showgrid=True),
)
# Show the plot
bar_plot.show()
Column Unique Counts 0 Job No 21286 1 Job Type 44 2 JOB_TYPE_DESCRIPTION 44 3 CONTRACTOR 33 4 Year of Build Date 36 5 Jobsourcedescription 15 6 Property Ref 2078 7 Property Type 10 8 Initial Priority 27 9 Initial Priority Description 31 10 Job Status 6 11 LATEST_PRIORITY 27 12 ABANDON_REASON_CODE 20 13 Day of Date Logged 548 14 SOR_CODE 1073 15 SOR_DESCRIPTION 1063 16 Date Logged 548 17 Mgt Area 3 18 TRADE_DESCRIPTION 31 19 Date Comp 544 20 Total Value 3306 21 ABANDON_REASON_DESC 19 22 JOB_STATUS_DESCRIPTION 6 23 Latest Priority Description 26
# Assuming 'Year of Build Date' and 'Total Value' are columns in Int_df_merged
numeric_variables = ['Year of Build Date', 'Total Value']
# Create a new DataFrame with only the selected numeric variables
numeric_repair_df = Int_df_merged[numeric_variables].copy()
# numeric_repair_df['Year of Build Date'] = numeric_repair_df['Year of Build Date'].astype('category')
# Display the new DataFrame
print(numeric_repair_df.head())
# Check unique values in each column
for column in numeric_repair_df.columns:
unique_values = numeric_repair_df[column].unique()
unique_count = numeric_repair_df[column].nunique()
print(f"Number of unique values in {column}: {unique_count}")
Year of Build Date Total Value 0 2021 100.00 1 2021 267.12 2 2021 88.45 3 2021 36.63 4 2021 100.00 Number of unique values in Year of Build Date: 36 Number of unique values in Total Value: 3306
# Separate data into categorical and numeric
categorical_data = Int_df_merged.select_dtypes(include='object')
numeric_data = Int_df_merged.select_dtypes(include=['int64', 'float64'])
# Get variable names
categorical_variables = list(categorical_data.columns)
numeric_variables = list(numeric_data.columns)
# Plot categorical and numeric variables side by side
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))
# Plot categorical variables with adjusted y-axis labels and spacing
axes[0].barh(range(len(categorical_variables)), [1] * len(categorical_variables), color='skyblue')
axes[0].set_title('Categorical Variables')
# Adjust y-axis labels and spacing
axes[0].set_yticks(range(len(categorical_variables)))
axes[0].set_yticklabels(categorical_variables, fontsize=8, ha='right')
# Plot numeric variables
axes[1].barh(range(len(numeric_variables)), [1] * len(numeric_variables), color='lightcoral')
axes[1].set_title('Numeric Variables')
axes[1].set_yticks(range(len(numeric_variables)))
axes[1].set_yticklabels(numeric_variables, fontsize=8, ha='right')
plt.tight_layout()
plt.show()
# # List of categorical columns for visualization
categorical_columns = ['JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Property Type', 'Jobsourcedescription','Initial Priority Description', 'Latest Priority Description', 'JOB_STATUS_DESCRIPTION', 'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC','SOR_DESCRIPTION']
#Master dataframe backup
int_df_bk = Int_df_merged.copy()
# int_df_bk.info()
Univariate Analysis- Exploratory Data AnalysisΒΆ
Count Plot (User Customised selection)ΒΆ
categorical_columns = [
'JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Property Type', 'Jobsourcedescription',
'Initial Priority Description', 'Latest Priority Description', 'JOB_STATUS_DESCRIPTION',
'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC', 'SOR_DESCRIPTION'
]
# Initialize the Dash app
app = dash.Dash(__name__)
# Layout of the app
app.layout = html.Div([
html.Label("Select a categorical column:"),
dcc.Dropdown(
id='column-dropdown',
options=[{'label': col, 'value': col} for col in categorical_columns],
value=categorical_columns[0], # Initial selected column
),
dcc.Graph(id='count-plot')
])
# Callback to update the count plot based on the selected dropdown column
@app.callback(
Output('count-plot', 'figure'),
[Input('column-dropdown', 'value')]
)
def update_count_plot(selected_column):
# Calculate value counts and percentages
value_counts = Int_df_merged[selected_column].value_counts()
percentages = (value_counts / len(Int_df_merged)) * 100
# Create a bar plot using Plotly Express with color differentiation
count_plot = px.bar(
x=value_counts.index,
y=value_counts.values,
color=value_counts.index, # Assign different colors based on the index values
text=value_counts.values,
labels={'y': 'Count', 'x': selected_column},
title=f'Count Plot for {selected_column}',
height=500, width=1000,
)
# Add custom hover information using a dictionary
hover_template = (
f"<b>{selected_column}:</b> %{{x}}<br>"
"<b>Count:</b> %{y}<br>"
"<b>Percentage:</b> %{customdata:.2f}%"
)
count_plot.update_traces(
hovertemplate=hover_template,
customdata=percentages.values,
)
# Update layout for title and axis labels
count_plot.update_layout(
title=dict(text=f'<b>Count Plot for {selected_column}</b>', x=0.5, y=0.95, font=dict(family='Arial')), # Center-align title and make it bold
xaxis_title=dict(text=f'<b>{selected_column}</b>', font=dict(family='Arial')), # Set x-axis label dynamically
yaxis_title=dict(text='<b>Count</b>', font=dict(family='Arial')), # Make y-axis label bold
)
return count_plot
# Run the app
# if __name__ == '__main__':
# app.run_server()
# 1- Count Plot for Categorical Columns:
Count Plot (Individual Plots)ΒΆ
# # common size for all plots
# common_height = 600
# common_width = 1000
# # Create subplots for each categorical variable
# for column in categorical_columns:
# # Create a DataFrame for the current categorical variable
# hist_df = pd.DataFrame(Int_df_merged[column].value_counts()).reset_index()
# hist_df.columns = [column, 'Count']
# # Calculate percentages and round to two decimal places
# hist_df['Percentage'] = (hist_df['Count'] / hist_df['Count'].sum()) * 100
# hist_df['Percentage'] = hist_df['Percentage'].round(2)
# # Define custom colors for the bars
# colors = px.colors.qualitative.Set1
# # Create a count plot with hover-over display and custom colors
# fig = px.bar(hist_df, x=column, y='Count', text='Count', color=column,
# title=f'<b>Count Plot for {column}</b>', color_discrete_sequence=colors)
# # Customize the layout
# fig.update_layout(
# xaxis_title=column,
# yaxis_title='Count',
# height=common_height,
# width=common_width,
# title_x=0.5, # Center-align the title
# )
# # Update hover template to show both 'Count' and 'Percentage'
# hover_template = f'<b>{column}</b>: %{{x}}<br><b>Count</b>: %{{y}}<br><b>Percentage</b>: %{{text}}'
# fig.update_traces(text=hist_df['Percentage'].astype(str) + '%', hovertemplate=hover_template)
# # Show the plot
# fig.show()
# common size for all plots
common_height = 600
common_width = 1000
# Create subplots for each categorical variable
for column in categorical_columns:
# Create a DataFrame for the current categorical variable
hist_df = pd.DataFrame(Int_df_merged[column].value_counts()).reset_index()
hist_df.columns = [column, 'Count']
# Calculate percentages and round to two decimal places
hist_df['Percentage'] = (hist_df['Count'] / hist_df['Count'].sum()) * 100
hist_df['Percentage'] = hist_df['Percentage'].round(2)
# Define custom colors for the bars
colors = px.colors.qualitative.Set1
# Create a count plot with hover-over display and custom colors
# Set the 'text' parameter to the 'Percentage' column of hist_df
fig = px.bar(hist_df, x=column, y='Count', text='Percentage', color=column,
title=f'<b>Count Plot for {column}</b>', color_discrete_sequence=colors)
# Customize the layout
fig.update_layout(
xaxis_title=column,
yaxis_title='Count',
height=common_height,
width=common_width,
title_x=0.5, # Center-align the title
)
# Update hover template to show both 'Count' and 'Percentage'
hover_template = f'<b>{column}</b>: %{{x}}<br><b>Count</b>: %{{y}}<br><b>Percentage</b>: %{{text}}%'
fig.update_traces(hovertemplate=hover_template, textposition='outside')
# Show the plot
fig.show()
Univariate Analysis Summarized FindingsΒΆ
JOB_TYPE_DESCRIPTION:ΒΆ
Shows "Responsive Repairs"(63.29%) are highest, followed by "Gas Responsive Repairs"(19.63%)
Contractor:ΒΆ
A large disproportnate (approx. 90%) number of contractors are not named, and seem to be potentially anonymous or data entry errors.
Property Type:ΒΆ
Property Types ("Terrace", End Terace" and "Access Direct") dominate the peck of housing repair requests in that order of approx (34.58%, 25.3%, 15.66%) respectively.
Jobsourcedescription:ΒΆ
A high volume of repair request calls originate from "CSC Phone call" (=60.81%) followed by via website (=15.75%).
Initial Priority Description:ΒΆ
Though not very pronounced, but a majority of request initial priorities are "Appointable" (=%39.36%) followed by missing blank "priority descriptions" (=18.71%) and Emergency requests (=16.77%)
Latest Priority Description:ΒΆ
But in contrast to initial priority of repair requests, priority status change in same uniform manner as refelected here with a majority of "Appointable" (=53.2%), Emergency requests (=17.11%), and Urgent PFI Evolve RD Irvine EMB (=10.78%) requests with very few missing latest priority descriptions (=199).
Job Status Description:ΒΆ
A very significant number of repairs status has been updated as "Invoice Accepted" (=75.57%) followed by "Abandoned" (=18.95%) status.
Trade Description:ΒΆ
Though dominant but a moderately no. of repair requests belong to "Gas Repairs" (=20.87%) followed by Carpenter (=18.31%) and Plumbing (=15.48%).
Abandon Reason Description:ΒΆ
Similarly though dominant but a moderate no. of repairs have been abandoned with reasons "No Work Required" (=21.77%) followed by "Alternative Job" (=20.46$) , "No Access" (=16.86%) and "Duplicate Order" (=10.04%).
# Setting up visualization style
sns.set(style="whitegrid")
# Exploratory Analysis
# 1. Type of Repairs
repair_type_counts = Int_df_merged['JOB_TYPE_DESCRIPTION'].value_counts()
# 2. Contractors
contractor_counts = Int_df_merged['CONTRACTOR'].value_counts()
# 3. Year of Build
year_build_counts = Int_df_merged['Year of Build Date'].value_counts()
# 4. Property Types
property_type_counts = Int_df_merged['Property Type'].value_counts()
# 5. Priority of Jobs
priority_counts = Int_df_merged['Initial Priority Description'].value_counts()
# 6. Cost Analysis
# Converting 'Total Value' to numeric for analysis
Int_df_merged['Total Value'] = pd.to_numeric(Int_df_merged['Total Value'], errors='coerce')
average_cost_per_job_type = Int_df_merged.groupby('JOB_TYPE_DESCRIPTION')['Total Value'].mean()
# Visualizing some of these distributions
# Increase the overall figure size for better visibility
fig, axes = plt.subplots(3, 2, figsize=(20, 20)) # Increased figure size
fig.suptitle('Exploratory Data Analysis of Borsetshire Repair Records')
# Plotting the distributions with adjusted label orientations and spacing
# Types of Repairs
sns.barplot(ax=axes[0, 0], x=repair_type_counts.values, y=repair_type_counts.index)
axes[0, 0].set_title('Types of Repairs')
axes[0, 0].set_xlabel('Count')
axes[0, 0].set_ylabel('Repair Type')
# Contractors
sns.barplot(ax=axes[0, 1], x=contractor_counts.values, y=contractor_counts.index)
axes[0, 1].set_title('Contractors')
axes[0, 1].set_xlabel('Count')
axes[0, 1].set_ylabel('Contractor')
# Year of Build
sns.barplot(ax=axes[1, 0], x=year_build_counts.index, y=year_build_counts.values)
axes[1, 0].set_title('Year of Build')
axes[1, 0].set_xlabel('Year')
axes[1, 0].set_ylabel('Count')
for label in axes[1, 0].get_xticklabels():
label.set_rotation(90) # Rotate x-axis labels
# Property Types
sns.barplot(ax=axes[1, 1], x=property_type_counts.values, y=property_type_counts.index)
axes[1, 1].set_title('Property Types')
axes[1, 1].set_xlabel('Count')
axes[1, 1].set_ylabel('Property Type')
# Priority of Jobs
sns.barplot(ax=axes[2, 0], x=priority_counts.values, y=priority_counts.index)
axes[2, 0].set_title('Priority of Jobs')
axes[2, 0].set_xlabel('Count')
axes[2, 0].set_ylabel('Priority')
# Average Cost per Job Type
sns.barplot(ax=axes[2, 1], x=average_cost_per_job_type.values, y=average_cost_per_job_type.index)
axes[2, 1].set_title('Average Cost per Job Type')
axes[2, 1].set_xlabel('Average Cost (Β£)')
axes[2, 1].set_ylabel('Job Type')
# Adjust overall layout
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
# List of categorical columns for visualization
# categorical_columns = ['Job Type', 'Property Type', 'Initial Priority Description', 'Job Status', 'TRADE_DESCRIPTION']
# Create a vertical layout with len(categorical_columns) subplots
fig, axes = plt.subplots(len(categorical_columns), 1, figsize=(20, 10 * len(categorical_columns)))
# Iterate over categorical columns and create a horizontal bar plot for each
for i, column in enumerate(categorical_columns):
value_counts = Int_df_merged[column].value_counts()
# Calculate percentage values
percentages = (value_counts / len(Int_df)) * 100
# Plot horizontal bar plot with percentage values
value_counts.plot(kind='barh', ax=axes[i], color='skyblue')
axes[i].set_title(f'Bar Plot for {column}')
axes[i].set_xlabel('Count')
# Display percentage values on the bars
for index, value in enumerate(value_counts):
axes[i].text(value, index, f'{percentages[index]:.2f}%', va='center')
# Adjust layout
plt.tight_layout()
# Show the plots
plt.show()
Count and %wise distribution of values inside a categorical variableΒΆ
# categorical_columns = ['JOB_TYPE_DESCRIPTION', 'Property Type', 'Initial Priority Description', 'JOB_STATUS_DESCRIPTION', 'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC']
# Create a dictionary to store DataFrames for each categorical variable
dataframes_dict = {}
# Iterate over each categorical variable
for column in categorical_columns:
# Create a DataFrame for the current categorical variable
value_counts_df = pd.DataFrame(Int_df_merged[column].value_counts()).reset_index()
value_counts_df.columns = [column, 'Count']
# Calculate percentages and add the percentage column
value_counts_df['Percentage'] = (value_counts_df['Count'] / value_counts_df['Count'].sum()) * 100
value_counts_df['Percentage'] = value_counts_df['Percentage'].round(2)
# Add the DataFrame to the dictionary
dataframes_dict[column] = value_counts_df[[column, 'Count', 'Percentage']]
# Access the DataFrames for each categorical variable
job_type_df = dataframes_dict['JOB_TYPE_DESCRIPTION']
property_type_df = dataframes_dict['Property Type']
initial_priority_df = dataframes_dict['Initial Priority Description']
latest_priority_df = dataframes_dict['Latest Priority Description']
job_status_df = dataframes_dict['JOB_STATUS_DESCRIPTION']
trade_df = dataframes_dict['TRADE_DESCRIPTION']
abandon_reason_df = dataframes_dict['ABANDON_REASON_DESC']
# Display the simplified DataFrames
print("Job Type DataFrame:")
print(job_type_df)
print("\nProperty Type DataFrame:")
print(property_type_df)
print("\nInitial Priority DataFrame:")
print(initial_priority_df)
print("\nLatest Priority DataFrame:")
print(latest_priority_df)
print("\nJob Status DataFrame:")
print(job_status_df)
print("\nTrade DataFrame:")
print(trade_df)
print("\nAbandon Reason DataFrame:")
print(abandon_reason_df)
Job Type DataFrame:
JOB_TYPE_DESCRIPTION Count Percentage
0 Responsive Repairs 13471 63.29
1 Gas Responsive Repairs 4179 19.63
2 Suspected Damp 752 3.53
3 Void Repairs 461 2.17
4 Communal Responsive Repairs 276 1.30
5 Fire Safety Equipment Inspections 248 1.17
6 XXXXXXAsbestos Inspections 185 0.87
7 Water Hygiene Inspections 180 0.85
8 Communal Gas Repairs 161 0.76
9 Communal Area Building Safety Inspection 151 0.71
10 Asbestos Inspections Planned 130 0.61
11 Commercial Lifts Inspections 117 0.55
12 XXXXXAsbestos Repairs 111 0.52
13 Asbestos Inspection Reactive 108 0.51
14 Fire Safety Equipment Repairs 98 0.46
15 Rechargeable Repairs 72 0.34
16 Section 11 Repairs 60 0.28
17 Door Access Control Repairs & Service 59 0.28
18 Lifts Consultants 46 0.22
19 Domestic Lifts Inspections 41 0.19
20 Fire Risk Repairs Planned 41 0.19
21 Door Inspection and Repairs 38 0.18
22 Aids and Adaptations 36 0.17
23 Domestic Lifts Repairs 35 0.16
24 Pre-Inspection 33 0.16
25 Fire Risk Repairs 27 0.13
26 Asbestos Repairs Planned 25 0.12
27 Asbestos Inspections Void 24 0.11
28 Warden Call Equipment Repairs 23 0.11
29 PAT Testing 21 0.10
30 Asbestos Repairs Reactive 12 0.06
31 Water Risk Inspection 10 0.05
32 Schedule Repairs Visit 9 0.04
33 Tenant Doing Own Repair 8 0.04
34 Gate and Barrier Repairs 8 0.04
35 Communal Gas Inspections 7 0.03
36 Asbestos Repairs Void 7 0.03
37 Gas Exclusion 4 0.02
38 Play Equipment Repairs 4 0.02
39 Asbestos Repairs Communal 3 0.01
40 Asbestos Inspection Communal 2 0.01
41 Water Hygiene Repairs 1 0.00
42 Play Equipment Inspections 1 0.00
43 Lightning Conductors and Fall Safety Rep 1 0.00
Property Type DataFrame:
Property Type Count Percentage
0 Terrace 7360 34.58
1 End Terrace 5385 25.30
2 Access direct 3333 15.66
3 Access via internal shared area 2298 10.80
4 Semi Detached 2097 9.85
5 Default 473 2.22
6 Detached 167 0.78
7 Block No Shared Area 122 0.57
8 Other Non-Rentable Space 37 0.17
9 0 14 0.07
Initial Priority DataFrame:
Initial Priority Description Count Percentage
0 Appointable 8378 39.36
1 3982 18.71
2 Emergency 3569 16.77
3 Urgent PFI Evolve RD Irvine EMB 1962 9.22
4 28 Calendar Days - Compliance 682 3.20
5 Pre Inspection 5 Working Days 475 2.23
6 7 Working Days - Compliance 346 1.63
7 Damp and Mould Follow-On Work 257 1.21
8 Two Week Void 194 0.91
9 Urgent GAS Evolve RD Irvine EMB 191 0.90
10 Three Day Void 181 0.85
11 56 Calendar Days - Compliance 164 0.77
12 Damp and Mould Inspection 144 0.68
13 Emergency - Compliance - 12 Hours 107 0.50
14 Health & Safety - Compliance - 4 Hours 84 0.39
15 Discretionary 82 0.39
16 76 Calendar Days - Compliance 68 0.32
17 3 Working Days - Compliance 66 0.31
18 Emergency - 12 Calendar Hours 66 0.31
19 Section 11 Works 60 0.28
20 Urgent - Compliance - 7 Calendar Days 55 0.26
21 Appointable - 20 Working Days 33 0.16
22 Emergency Health and Safety 30 0.14
23 335 Calendar Days - Compliance 30 0.14
24 12 Calendar Hours 27 0.13
25 38 Calendar Days - Compliance 14 0.07
26 Urgent GAS - 3 Working Days 14 0.07
27 10 Working Days - Compliance 11 0.05
28 112 Calendar Days - Compliance 7 0.03
29 Major Responsive Repairs 6 0.03
30 700 Calendar Days - Compliance 1 0.00
Latest Priority DataFrame:
Latest Priority Description Count Percentage
0 Appointable 11324 53.20
1 Emergency 3641 17.11
2 Urgent PFI Evolve RD Irvine EMB 2295 10.78
3 691 3.25
4 28 Calendar Days - Compliance 678 3.19
5 Pre Inspection 5 Working Days 572 2.69
6 7 Working Days - Compliance 346 1.63
7 Two Week Void 278 1.31
8 Damp and Mould Follow-On Work 257 1.21
9 Urgent GAS Evolve RD Irvine EMB 205 0.96
10 Three Day Void 181 0.85
11 56 Calendar Days - Compliance 170 0.80
12 Emergency - Compliance - 12 Hours 134 0.63
13 Health & Safety - Compliance - 4 Hours 84 0.39
14 Discretionary 82 0.39
15 76 Calendar Days - Compliance 68 0.32
16 3 Working Days - Compliance 66 0.31
17 Section 11 Works 60 0.28
18 Urgent - Compliance - 7 Calendar Days 55 0.26
19 Emergency Health and Safety 30 0.14
20 335 Calendar Days - Compliance 30 0.14
21 38 Calendar Days - Compliance 14 0.07
22 10 Working Days - Compliance 11 0.05
23 112 Calendar Days - Compliance 7 0.03
24 Major Responsive Repairs 6 0.03
25 700 Calendar Days - Compliance 1 0.00
Job Status DataFrame:
JOB_STATUS_DESCRIPTION Count Percentage
0 Invoice Accepted 16086 75.57
1 Abandoned 4033 18.95
2 Job Logged 741 3.48
3 Work Completed 377 1.77
4 Pre-Inspection 41 0.19
5 Note Job 8 0.04
Trade DataFrame:
TRADE_DESCRIPTION Count Percentage
0 Gas Repairs 4442 20.87
1 Carpenter 3471 16.31
2 Plumbing 3296 15.48
3 Electrician 2178 10.23
4 Fencing 928 4.36
5 Floor Wall Ceilings 830 3.90
6 Asbestos 605 2.84
7 Pound Jobs No SOR 568 2.67
8 Miscellaneous Works 504 2.37
9 Roofing 492 2.31
10 Brickwork/Blockwork 460 2.16
11 Painting and Decorating 435 2.04
12 Out of Hours Work 428 2.01
13 Fire 415 1.95
14 Drainage Works 373 1.75
15 Groundwork 292 1.37
16 Lifts 238 1.12
17 Glazing 211 0.99
18 198 0.93
19 Multi Trade 198 0.93
20 Water 191 0.90
21 Void Repairs 190 0.89
22 Rechargeable 66 0.31
23 Scaffold 63 0.30
24 Door Access Control 59 0.28
25 Mechanical Services 50 0.23
26 Disabled Adaptations 43 0.20
27 Concrete External Works 26 0.12
28 Warden Call 23 0.11
29 Inspection 8 0.04
30 Play and Recreation 5 0.02
Abandon Reason DataFrame:
ABANDON_REASON_DESC Count Percentage
0 No Work Required 878 21.77
1 Alternative Job 825 20.46
2 No Access 680 16.86
3 Duplicate Order 405 10.04
4 Tenant Missed Appt 358 8.88
5 Wrong Contractor 223 5.53
6 Input Error 173 4.29
7 Tenant Refusal 152 3.77
8 Inspection Not Required 101 2.50
9 Added to Planned Programme 92 2.28
10 Abortive Call 88 2.18
11 Data Clean Up 23 0.57
12 Riverside Not Approved 11 0.27
13 See Repair Memo 9 0.22
14 Testing 6 0.15
15 Work Under Guarantee 3 0.07
16 No Charge 3 0.07
17 Work Deferred 2 0.05
18 Contractor Link Reason 1 0.02
Univariate Analysis (On individual Features)ΒΆ
1- Among Property Types, This is the decreasing order of precedence in which different jobs has been attended to; a) "Terrace" (34.58%), End Terrace (25.3%), Access direct (15.66%), with "Detached" and "Other Non-Rentable Space" being lowest being attended to.
2- Among job types being attended to, "Responsive Repairs" is the most prevalent job type (63.29%), followed by "Gas Responsive Repairs" (19.63%) with "Lightning Conductors and Fall Safety Rep" repair job having the lowest prevalence.
3- Among Intial Priority of the job at the time of incident logging, "Appointable" is at the top (48.42%) followed by Emergency (20.63%), and Urgent PFI Evolve RD Irvine EMB (11.34%) with all other initial job priorities below (4%) in single digits.
4- Similarly Latest Priority of the job at the time of incident resolution, "Appointable" is at the top (53.7%) followed by Emergency (17.27%), and Urgent PFI Evolve RD Irvine EMB (10.88%) with all other latest job priorities below (4%) in single digits.
5- Among job status states, "Invoice Accepted" is the most prevalent at 75.57% followed by "Abandoned" being the 2nd highest prevalent (18.95%), while all others are below 4% with "Note Job" being the lowest at 0.04%.
6- Among the trade (i.e. Repair task) being carried out at the property, "Gas Repairs" is the most common task at 21.06% with other 3 tasks like ("Carpenter", "Plumbing", and "Electrician") not too far behind with 16.46%, 15.63%, and 10.33% times being carried out, while rest of the tasks being very less common (below 5%) with "Play and Recreation" task having the lowest prevalence.
7- "No Access" is the most prominient reason for properties being unavailable for the repair task (=84/25%) and beong abandoned (ABANDON_REASON) with rest all other reasons habing almost no impact on the property repair availability with values in single digits (<5%), though interstingly after being complained raised, but upon inspection (=4.12%) being found suitable with no need for repair (=4.12%, ABANDON_REASON = "No Work Required"), while (=3.88%) properties being abandoned requiring an alternative job.
8- Here we can see that job priority status gradually changes i.e increases to appointable status from the time of logging to the complaint resolution time, as understandably inital non-appointable properties comes under the appointment radar, possibly because of the owners inital absence at the time of logging and later availability.
# List of columns for which you want to get unique values count
columns_to_check = [
'JOB_TYPE_DESCRIPTION',
'CONTRACTOR',
'Jobsourcedescription',
'Property Type',
'Initial Priority Description',
'Mgt Area',
'TRADE_DESCRIPTION',
'ABANDON_REASON_DESC',
'JOB_STATUS_DESCRIPTION',
'Latest Priority Description'
]
# Get unique values count for each specified column
unique_values_counts = {col: len(Int_df_merged[col].unique()) for col in columns_to_check}
# Display the result
for col, count in unique_values_counts.items():
print(f'{col}: {count} unique values')
JOB_TYPE_DESCRIPTION: 44 unique values CONTRACTOR: 33 unique values Jobsourcedescription: 15 unique values Property Type: 10 unique values Initial Priority Description: 31 unique values Mgt Area: 3 unique values TRADE_DESCRIPTION: 31 unique values ABANDON_REASON_DESC: 20 unique values JOB_STATUS_DESCRIPTION: 6 unique values Latest Priority Description: 26 unique values
Analysis Objective- Profile of repairs in Borsetshire county;ΒΆ
Potential combinations of variables for cross-tabulation plots to gain insights into repair patterns;
Job Type vs. Initial Priority Description:ΒΆ
Cross-tabulate the types of repair jobs with their initial priority descriptions to understand the distribution of priority levels for each job type.
Property Type vs. Initial Priority Description:ΒΆ
Examine how the initial priority of repairs varies across different property types.
Property Type vs. Final Priority Description:ΒΆ
Examine how the final priority of repairs varies across different property types.
Property Type vs. Job Status Description:ΒΆ
Analyze the relationship between the property type and the status of repair jobs.
Property Type vs. Trade Description:ΒΆ
Investigate if there are certain linkages between certain types of properties and their repair task to be carried out.
Property Type vs. Abandon Reason Description:ΒΆ
Investigate if there are any type of properties that are more prone to abandonment due to some reason.
Trade Description vs. Abandon Reason Description:ΒΆ
Examine the reasons for abandoning repair jobs within specific trade categories.
Property Type vs. Contractor:ΒΆ
Investigate the distribution of contractors across different property types.
Mgt Area vs. Trade Description:ΒΆ
Explore the distribution of repair trades across different management areas.
Year of Build Date vs. Total Value:ΒΆ
Investigate how the total repair value is distributed across the years of property build.
Month vs. Total Value:ΒΆ
Understand the monthly distribution of total repair values.
Initial Priority Description vs. Job Status Description:ΒΆ
Explore the relationship between the initial priority of repairs and their current status.
Year vs. Total Value:ΒΆ
Analyze how the total value of repairs varies across different years.
Pairwise Scatter PlotΒΆ
# Pairwise Scatter Plot
numeric_columns = Int_df_merged.select_dtypes(include=['int64', 'float64','datetime64']).columns.tolist()
# Columns of interest
# Cols_of_interest = ['JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Mgt Area', 'Year of Build Date',
# 'Jobsourcedescription', 'Property Type', 'Initial Priority Description',
# 'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC', 'JOB_STATUS_DESCRIPTION',
# 'Latest Priority Description', 'Total Value']
# Pairwise Scatter Plot
pairplot = sns.pairplot(Int_df_merged[numeric_columns], height=3)
for ax in pairplot.axes.flat:
ax.tick_params(axis='both', which='both', labelsize=12, width=2, length=6, direction='inout')
ax.xaxis.label.set_fontsize(14)
ax.yaxis.label.set_fontsize(14)
ax.xaxis.label.set_fontweight('bold')
ax.yaxis.label.set_fontweight('bold')
pairplot.fig.suptitle('Pairwise Scatter Plots', y=1.02, fontsize=18, fontweight='bold')
plt.show()
Scatter plot of individual Categorical Predictor(s) against "Total Repair Value"ΒΆ
# Columns of interest
Cols_of_interest = [
'JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Mgt Area', 'Jobsourcedescription',
'Property Type', 'Initial Priority Description', 'Latest Priority Description',
'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC', 'JOB_STATUS_DESCRIPTION'
]
# Response variable
response = 'Total Value'
# Create a figure with subplots
fig, axes = plt.subplots(nrows=len(Cols_of_interest), ncols=1, figsize=(10, 5 * len(Cols_of_interest)))
# Loop through each column of interest and create a plot
for i, col in enumerate(Cols_of_interest):
sns.scatterplot(data=Int_df_merged, x=col, y=response, ax=axes[i])
axes[i].set_title(f'Scatter Plot of {col} vs {response}', fontweight="bold", fontsize=12)
axes[i].set_xlabel(col)
axes[i].set_ylabel(response)
axes[i].tick_params(axis='x', rotation=90) # Rotate x-axis labels
plt.tight_layout()
plt.show()
Pairplot and Scatterplot observationsΒΆ
Pairplot (numeric /datetime predictors with each other)ΒΆ
There are a few outliers in the "Total Value" variable, with most data points clustered at the lower end and a few extreme values going up to around 17,500.
The "Year of Build Date" variable has a large cluster of points at around the year 2000, with some sparse older dates stretching back to 1800.
For the variables "Year_comp_log" and "Year_comp_solved", the data points are discrete and heavily concentrated around specific values, suggesting these might represent categorical data or years with limited variability.
The "Days Taken" variable is slightly right skewed with few potential outliers.
Overall, there is no clear linear relationship between "Total Value" and "Year of Build Date".
"Days Taken" in isolation does not have much impact in total repair cost, which indicates it has
"Year of Build Date" appears to have a non-uniform distribution, with two noticeable periods of high activity or data concentrationβaround the 1900s and a significant spike around the year 2000.
The "Days Taken" distribution has a peculiar pattern, with a concentration of points at specific positive values and a less dense, but notable number of points with negative values, which is due to missing "Day Comp" values.
Scatterplot (categorical predictors with "total repair value")ΒΆ
- JOB_TYPE_DESCRIPTION - Certain repairs like "Void Repairs", "Responsive Repairs", "Suspected Damp" seems to have a large clusters of data points, indicating the volume of tasks undertaken.
- CONTRACTOR - "N/A" category contractors dominate in carrying out the most repair tasks.
- Mgt. Area - "MA1" category management area dominate in supervising a very large no of repair tasks.
- Jobsourcedescription - Primarily "Total Mobile App", "One Mobile App", and "CSC Phone call" are the orgin of a large no of repair complaint logs.
- Property Type - Property types "Terrace", "Access Direct" , and "End Terrace" has most number of repair requests.
- Initial Priority Description - Initial complaint priority ("Two week void") dominate the list of all initial priorities.
- Final Priority Description - Similarly, Final complaint priority ("Two week void") dominate the list of all final priorities.
- TRADE_Description - "Carpentry" trade dominate in a large no of repairs.
- ABANDON_REASON_DESC - Interstingly, all the valid abandon reason codes have Total repair values assigned as '0', while non-assigned abandon reason codes (blank/null) has all repair values in the original data, which scatter plot reflects as a straignt line at 0.
- JOB_STATUS_DESCRIPTION- "Job Logged", "Invoice Accepted", and "Work Completed" dominate the list in that order for which most repair tasks have been carried out.
# Convert Date Logged to datetime type with specified format
Int_df_merged['Date Logged'] = pd.to_datetime(Int_df_merged['Date Logged'], format='%d/%m/%Y')
# Extract unique months from the Date Logged column
unique_months = Int_df_merged['Date Logged'].dt.to_period('M').unique()
# Display the number of unique months
num_months = len(unique_months)
print(f"There are {num_months} unique months in the DataFrame.")
# Find the earliest (lowest) and latest (highest) dates
earliest_date = Int_df_merged['Date Logged'].min()
latest_date = Int_df_merged['Date Logged'].max()
print(f"The earliest date in the DataFrame is {earliest_date.strftime('%Y-%m-%d')}.")
print(f"The latest date in the DataFrame is {latest_date.strftime('%Y-%m-%d')}.")
There are 19 unique months in the DataFrame. The earliest date in the DataFrame is 2022-06-09. The latest date in the DataFrame is 2023-12-08.
# # Plot using Plotly Express
# fig1 = px.scatter(numeric_repair_df, x='Year of Build Date', y='Total Value', title='Scatter Plot')
# fig1.show()
# x = numeric_repair_df['Total Value']
# hist_data = [x]
# group_labels = ['distplot of Total Value'] # col name
# colors = ['#A56CC1']
numeric_repair_df['Year of Build Date'] = pd.to_numeric(numeric_repair_df['Year of Build Date'], errors='coerce')
fig1 = px.scatter(numeric_repair_df, x='Year of Build Date', y='Total Value', color='Year of Build Date',
title='Scatter Plot', labels={'Total Value': 'Total Value'})
# Update x-axis tick mode to 'array' to display all years
# Increase the spacing between axis markers using dtick
fig1.update_xaxes(tickmode='array', tickvals=numeric_repair_df['Year of Build Date'].unique(), tickangle=45, dtick=10)
# Increase the size of the plot
fig1.update_layout(
height=600,
width=1000,
title=dict(
text='<b>Scatter Plot - Year of Build and Total Value</b>',
x=0.5, # Center the title
)
)
# Show the plot
fig1.show()
# Manually sort the DataFrame by 'Year of Build Date'
numeric_repair_df = numeric_repair_df.sort_values(by='Year of Build Date')
# Convert 'Year of Build Date' to categorical with ordered categories
sorted_years = sorted(numeric_repair_df['Year of Build Date'].unique())
numeric_repair_df['Year of Build Date'] = pd.Categorical(
numeric_repair_df['Year of Build Date'],
categories=sorted_years,
ordered=True
)
# Create scatter plot with animation frame
fig = px.scatter(
numeric_repair_df,
x='Year of Build Date',
y='Total Value',
animation_frame='Year of Build Date',
color='Total Value',
color_continuous_scale='Viridis',
labels={'Total Value': 'Total Value'},
title='<b>Scatter Plot with Color Intensity for Year Build Date vs. Total Value</b>'
)
# Customize the layout to remove x-axis decimals, add a range slider, and slow down the animation
fig.update_layout(
xaxis_title='Year of Build Date',
yaxis_title='Total Value',
coloraxis_colorbar=dict(title='Total Value'),
title_x=0.5,
xaxis=dict(
tickmode='linear',
range=[1800, 2023],
dtick=10
),
updatemenus=[{
'type': 'buttons',
'showactive': False,
'buttons': [{
'label': 'Play',
'method': 'animate',
'args': [None, {'frame': {'duration': 4000, 'redraw': True}, 'fromcurrent': True}]
}, {
'label': 'Pause',
'method': 'animate',
'args': [[None], {'frame': {'duration': 0, 'redraw': True}, 'mode': 'immediate', 'transition': {'duration': 0}}]
}]
}],
sliders=[{
'active': 0,
'steps': [{
'args': [[frame], {'frame': {'duration': 4000, 'redraw': True}, 'mode': 'immediate', 'transition': {'duration': 0}}],
'label': str(frame), # Convert frame to string
'method': 'animate',
} for frame in sorted_years]
}]
)
# Show the plot
fig.show()
Year of Build Date vs. Total Value:ΒΆ
Explore the relationship between the year a property was built and the total value of repair jobs. This can provide insights into whether older properties require more expensive repairs.
Investigate how the total repair value is distributed across the years of property build.
1- Here we cannot see any pattern of repair costs based on the age of building construction, which seems intutive as these are not building related manintenace repairs.
2- We can see here that certain anomalies (i.e. outliers) as certain years (2007,2015,1990) has high avg household manintenance costs with values 390, 324, and 307 respectively.
Repair Complaints Logged Over the Years and MonthsΒΆ
1- Year 2023 has more complaints than 2022 with 53.26% increase over previous year.
2- Though it is not concusive, but It seems from the monthly repair patterns that mid and later part of year months (from Jun-Nov) has more compliants on average across years, though from Dec month it show a declining trend.
3- Most of the repair complaints logs are over mid of the month.
# Int_df_merged.info()
# Count the number of missing values (including empty strings) in 'column1'
# missing_values_count = Int_df_merged['Date Comp'].apply(lambda x: x == '' or pd.isna(x)).sum()
# print(f"Number of missing or empty values in 'column1': {missing_values_count}")
# Int_df.info()
# Convert date columns to datetime format
Int_df_merged['Date Logged'] = pd.to_datetime(Int_df_merged['Date Logged'], infer_datetime_format=True)
Int_df_merged['Date Comp'] = pd.to_datetime(Int_df_merged['Date Comp'], infer_datetime_format=True)
# Get the minimum date from 'Date Logged'
min_date = Int_df_merged['Date Logged'].min()
# Ensure the DataFrame only includes data starting from the minimum date
Int_df_merged = Int_df_merged[Int_df_merged['Date Logged'] >= min_date]
# Resample and count complaints logged per month
complaints_logged_per_month = Int_df_merged.resample('M', on='Date Logged').size().reset_index(name='Logged_Count')
complaints_logged_per_month['YearMonth'] = complaints_logged_per_month['Date Logged'].dt.strftime('%Y-%b')
# Resample and count complaints solved per month
complaints_solved_per_month = Int_df_merged.resample('M', on='Date Comp').size().reset_index(name='Solved_Count')
complaints_solved_per_month['YearMonth'] = complaints_solved_per_month['Date Comp'].dt.strftime('%Y-%b')
# Create the plot
fig = go.Figure()
# Add traces for logged complaints
fig.add_trace(go.Scatter(x=complaints_logged_per_month['Date Logged'],
y=complaints_logged_per_month['Logged_Count'],
mode='lines+markers', name='Logged Complaints',
hovertemplate='<b>Year-Month: %{text}<br>Logged: %{y}</b>',
text=complaints_logged_per_month['YearMonth']))
# Add traces for solved complaints
fig.add_trace(go.Scatter(x=complaints_solved_per_month['Date Comp'],
y=complaints_solved_per_month['Solved_Count'],
mode='lines+markers', name='Solved Complaints',
hovertemplate='<b>Year-Month: %{text}<br>Solved: %{y}</b>',
text=complaints_solved_per_month['YearMonth']))
# Update layout
fig.update_layout(title_text='<b>Repair Complaints Logged vs Solved</b>',
xaxis_title='<b>Date</b>', yaxis_title='<b>Number of Complaints</b>',
title_x=0.5, showlegend=True)
# Show the plot
fig.show()
Int_df_merged['Date Logged'] = pd.to_datetime(Int_df_merged['Date Logged'])
Int_df_merged['Date Comp'] = pd.to_datetime(Int_df_merged['Date Comp'])
# Extract relevant temporal features
Int_df_merged['Year_comp_log'] = Int_df_merged['Date Logged'].dt.year
Int_df_merged['Year_comp_solved'] = Int_df_merged['Date Comp'].dt.year
# Count the number of complaints per year and date
complaints_per_date = Int_df_merged.groupby(['Year_comp_log', 'Date Logged']).size().reset_index(name='Count')
complaints_solved_per_date = Int_df_merged.groupby(['Year_comp_solved', 'Date Comp']).size().reset_index(name='Count')
# Create traces for logged complaints
traces_logged = [go.Scatter(x=df_year['Date Logged'], y=df_year['Count'],
mode='lines+markers', name=f'Logged {year}',
hovertemplate='%{y} Complaints<br>Year: %{text}',
text=df_year['Year_comp_log'])
for year, df_year in complaints_per_date.groupby('Year_comp_log')]
# Create traces for solved complaints
traces_solved = [go.Scatter(x=df_year['Date Comp'], y=df_year['Count'],
mode='lines+markers', name=f'Solved {year}',
hovertemplate='%{y} Complaints Solved<br>Year: %{text}',
text=df_year['Year_comp_solved'])
for year, df_year in complaints_solved_per_date.groupby('Year_comp_solved')]
# Create figures for both logged and solved complaints
fig_logged = go.Figure(data=traces_logged, layout=go.Layout(title='<b>Number of Repair Complaints Logged Over Years</b>',
showlegend=True, title_x=0.5,
xaxis=dict(title='Date'),
yaxis=dict(title='Number of Complaints')))
fig_solved = go.Figure(data=traces_solved, layout=go.Layout(title='<b>Number of Repair Complaints Solved Over Years</b>',
showlegend=True, title_x=0.5,
xaxis=dict(title='Date'),
yaxis=dict(title='Number of Complaints Solved')))
# Show the plots
fig_logged.show()
fig_solved.show()
# Convert date columns to datetime format
Int_df_merged['Date Logged'] = pd.to_datetime(Int_df_merged['Date Logged'])
Int_df_merged['Date Comp'] = pd.to_datetime(Int_df_merged['Date Comp'])
# Extract relevant temporal features
Int_df_merged['Year_comp_log'] = Int_df_merged['Date Logged'].dt.year
Int_df_merged['Year_comp_solved'] = Int_df_merged['Date Comp'].dt.year
# Count the number of complaints logged per year and date
complaints_logged_per_date = Int_df_merged.groupby(['Year_comp_log', 'Date Logged']).size().reset_index(name='Logged_Count')
# Count the number of complaints solved per year and date
complaints_solved_per_date = Int_df_merged.groupby(['Year_comp_solved', 'Date Comp']).size().reset_index(name='Solved_Count')
# Create traces for logged complaints
logged_traces = []
for year in sorted(complaints_logged_per_date['Year_comp_log'].unique()):
df_year = complaints_logged_per_date[complaints_logged_per_date['Year_comp_log'] == year]
trace = go.Scatter(x=df_year['Date Logged'], y=df_year['Logged_Count'],
mode='lines+markers', name=f'Logged {year}',
hovertemplate='%{y} Complaints Logged<br>Year: %{text}',
text=df_year['Year_comp_log'])
logged_traces.append(trace)
# Create traces for solved complaints
solved_traces = []
for year in sorted(complaints_solved_per_date['Year_comp_solved'].unique()):
df_year = complaints_solved_per_date[complaints_solved_per_date['Year_comp_solved'] == year]
trace = go.Scatter(x=df_year['Date Comp'], y=df_year['Solved_Count'],
mode='lines+markers', name=f'Solved {year}',
hovertemplate='%{y} Complaints Solved<br>Year: %{text}',
text=df_year['Year_comp_solved'])
solved_traces.append(trace)
# Combine all traces
all_traces = logged_traces + solved_traces
# Create layout
layout = go.Layout(title='<b>Number of Repair Complaints Logged vs Solved Over Years</b>', showlegend=True, title_x=0.5,
xaxis=dict(title='Date'), yaxis=dict(title='Number of Complaints'))
# Create figure
fig = go.Figure(data=all_traces, layout=layout)
# Show the plot
fig.show()
# Convert date columns to datetime format
Int_df_merged['Date Logged'] = pd.to_datetime(Int_df_merged['Date Logged'], errors='coerce')
# Extract relevant temporal features
Int_df_merged['Year_comp_log'] = Int_df_merged['Date Logged'].dt.year
Int_df_merged['Month_comp_log'] = Int_df_merged['Date Logged'].dt.month_name()
# Count the number of complaints per year and month
complaints_per_month = Int_df_merged.groupby(['Year_comp_log', 'Month_comp_log']).size().reset_index(name='Count')
# Ensure proper sorting of months
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
complaints_per_month['Month_comp_log'] = pd.Categorical(complaints_per_month['Month_comp_log'], categories=month_order, ordered=True)
complaints_per_month = complaints_per_month.sort_values(['Year_comp_log', 'Month_comp_log'])
# Create traces for each year
traces = []
for year in sorted(complaints_per_month['Year_comp_log'].unique()):
df_year = complaints_per_month[complaints_per_month['Year_comp_log'] == year]
trace = go.Scatter(x=df_year['Month_comp_log'], y=df_year['Count'],
mode='lines+markers', name=f'Year {year}',
hovertemplate='%{y} Complaints<br>Month: %{x}',
text=str(year))
traces.append(trace)
# Create layout with ordered month labels
layout = go.Layout(title='<b>Number of Repair Complaints Logged Over Years and All Months</b>', showlegend=True,
xaxis=dict(title='<b>Month</b>', categoryorder='array', categoryarray=month_order),
yaxis=dict(title='<b>Number of Complaints</b>'), title_x=0.5)
# Create figure
fig = go.Figure(data=traces, layout=layout)
# Show the plot
fig.show()
# job_id_to_filter = 1523686
# # Filter the DataFrame for the specific Job Id
# filtered_df =Int_df_merged[Int_df_merged['Job No'] == job_id_to_filter]
# # Display the filtered DataFrame
# print(filtered_df)
# # Check the data type of 'Job No'
# print("Data type of 'Job No':", Int_df_merged['Job No'].dtype)
# # Check for unique values (or a sample of them) in 'Job No'
# print("Unique values in 'Job No':", Int_df_merged['Job No'].unique()[:10])
# # Ensure job_id_to_filter is the same data type as 'Job No'
# job_id_to_filter = 1523686 # or '1523686' if 'Job No' is a string
# print("Data type of job_id_to_filter:", type(job_id_to_filter))
# # Filter the DataFrame
# filtered_df = Int_df_merged[Int_df_merged['Job No'] == job_id_to_filter]
# # Check the filtered DataFrame
# print(filtered_df)
# missing_values_count = Int_df_merged['Date Comp'].apply(lambda x: x == '' or pd.isna(x)).sum()
# print(f"Number of missing or empty values in 'column1': {missing_values_count}")
# Adjust the filter to include rows where 'Date Comp' is null or greater than or equal to min_date,
# and 'Date Logged' is greater than or equal to min_date
# missing_values_count = Int_df_merged1['Date Comp'].apply(lambda x: x == '' or pd.isna(x)).sum()
# print(f"Number of missing or empty values in 'column1': {missing_values_count}")
Int_df_merged['Date Logged'] = pd.to_datetime(Int_df_merged['Date Logged'], errors='coerce')
Int_df_merged['Date Comp'] = pd.to_datetime(Int_df_merged['Date Comp'], errors='coerce')
# Find the earliest dates
min_logged_date = Int_df_merged['Date Logged'].min()
min_comp_date = Int_df_merged['Date Comp'].min()
min_date = min(min_logged_date, min_comp_date)
# Filter data from the earliest date onwards
# Int_df_merged = Int_df_merged[(Int_df_merged['Date Logged'] >= min_date) & (Int_df_merged['Date Comp'] >= min_date)]
# Adjusted filter to keep rows where 'Date Comp' is null
# Int_df_merged = Int_df_merged[(Int_df_merged['Date Logged'] >= min_date) | pd.isna(Int_df_merged['Date Comp'])]
Int_df_merged = Int_df_merged[((Int_df_merged['Date Comp'] >= min_date) | pd.isna(Int_df_merged['Date Comp'])) & (Int_df_merged['Date Logged'] >= min_date)]
# Extract relevant temporal features for logged complaints
Int_df_merged['Year_comp_log'] = Int_df_merged['Date Logged'].dt.year
Int_df_merged['Month_comp_log'] = Int_df_merged['Date Logged'].dt.month_name()
# Extract relevant temporal features for solved complaints
Int_df_merged['Year_comp_solved'] = Int_df_merged['Date Comp'].dt.year
Int_df_merged['Month_comp_solved'] = Int_df_merged['Date Comp'].dt.month_name()
# Group and count complaints by year and month
# (Logged complaints)
complaints_logged_per_month = Int_df_merged.groupby(['Year_comp_log', 'Month_comp_log']).size().reset_index(name='Logged_Count')
# (Solved complaints)
complaints_solved_per_month = Int_df_merged.groupby(['Year_comp_solved', 'Month_comp_solved']).size().reset_index(name='Solved_Count')
# Ensure proper sorting of months
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
complaints_logged_per_month['Month_comp_log'] = pd.Categorical(complaints_logged_per_month['Month_comp_log'], categories=month_order, ordered=True)
complaints_logged_per_month = complaints_logged_per_month.sort_values(['Year_comp_log', 'Month_comp_log'])
complaints_solved_per_month['Month_comp_solved'] = pd.Categorical(complaints_solved_per_month['Month_comp_solved'], categories=month_order, ordered=True)
complaints_solved_per_month = complaints_solved_per_month.sort_values(['Year_comp_solved', 'Month_comp_solved'])
# Convert to integer
Int_df_merged['Year_comp_log'] = Int_df_merged['Year_comp_log'].astype(int)
# Replace NaN values with a placeholder year (e.g., -1)
Int_df_merged['Year_comp_solved'] = Int_df_merged['Year_comp_solved'].fillna(-1)
# Convert to integer
Int_df_merged['Year_comp_solved'] = Int_df_merged['Year_comp_solved'].astype(int)
# Create figure
fig = go.Figure()
# # Add traces for logged complaints
# for year in sorted(complaints_logged_per_month['Year_comp_log'].unique()):
# df_year = complaints_logged_per_month[complaints_logged_per_month['Year_comp_log'] == year]
# fig.add_trace(go.Scatter(x=df_year['Month_comp_log'], y=df_year['Logged_Count'],
# mode='lines+markers', name=f'Logged {year}',
# hovertemplate='%{y} Complaints Logged<br>Month: %{x}'))
# # Add traces for solved complaints
# for year in sorted(complaints_solved_per_month['Year_comp_solved'].unique()):
# df_year = complaints_solved_per_month[complaints_solved_per_month['Year_comp_solved'] == year]
# fig.add_trace(go.Scatter(x=df_year['Month_comp_solved'], y=df_year['Solved_Count'],
# mode='lines+markers', name=f'Solved {year}',
# hovertemplate='%{y} Complaints Solved<br>Month: %{x}'))
# Add traces for logged complaints
for year in sorted(complaints_logged_per_month['Year_comp_log'].unique()):
df_year = complaints_logged_per_month[complaints_logged_per_month['Year_comp_log'] == year]
year_str = str(int(year)) # Convert year to string
fig.add_trace(go.Scatter(x=df_year['Month_comp_log'], y=df_year['Logged_Count'],
mode='lines+markers', name=f'Logged {year_str}',
hovertemplate='%{y} Complaints Logged<br>Month: %{x}'))
# Add traces for solved complaints
for year in sorted(complaints_solved_per_month['Year_comp_solved'].unique()):
df_year = complaints_solved_per_month[complaints_solved_per_month['Year_comp_solved'] == year]
year_str = str(int(year)) # Convert year to string
fig.add_trace(go.Scatter(x=df_year['Month_comp_solved'], y=df_year['Solved_Count'],
mode='lines+markers', name=f'Solved {year_str}',
hovertemplate='%{y} Complaints Solved<br>Month: %{x}'))
# Update layout
fig.update_layout(title='<b>Avg. Number of Repair Complaints Logged vs Solved Over Years and All Months</b>', showlegend=True,
xaxis=dict(title='<b>Month</b>', categoryorder='array', categoryarray=month_order),
yaxis=dict(title='<b>Number of Complaints</b>'), title_x=0.5)
# Show the plot
fig.show()
# missing_values_count = Int_df_merged['Date Comp'].apply(lambda x: x == '' or pd.isna(x)).sum()
# print(f"Number of missing or empty values in 'column1': {missing_values_count}")
# Int_df_merged.info()
# Convert date columns to datetime format
Int_df_merged['Date Logged'] = pd.to_datetime(Int_df_merged['Date Logged'], errors='coerce')
# Drop rows where 'Date Logged' is NaT after conversion
Int_df_merged = Int_df_merged.dropna(subset=['Date Logged'])
# Extract relevant temporal features
Int_df_merged['Year_comp_log'] = Int_df_merged['Date Logged'].dt.year
Int_df_merged['Day of Date Logged'] = Int_df_merged['Date Logged'].dt.day_name()
# Count the number of complaints per year and day
complaints_per_day = Int_df_merged.groupby(['Year_comp_log', 'Day of Date Logged']).size().reset_index(name='Count')
# Ensure proper sorting of days
day_order = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
complaints_per_day['Day of Date Logged'] = pd.Categorical(complaints_per_day['Day of Date Logged'], categories=day_order, ordered=True)
complaints_per_day = complaints_per_day.sort_values(['Year_comp_log', 'Day of Date Logged'])
# Create traces for each year
traces = []
for year in sorted(complaints_per_day['Year_comp_log'].unique()):
df_year = complaints_per_day[complaints_per_day['Year_comp_log'] == year]
trace = go.Scatter(x=df_year['Day of Date Logged'], y=df_year['Count'],
mode='lines+markers', name=f'Year {year}',
hovertemplate='%{y} Complaints<br>Day: %{x}',
text=str(year))
traces.append(trace)
# Create layout with ordered day labels
layout = go.Layout(title='<b>Number of Repair Complaints Logged Over Years and WeekDays</b>', showlegend=True,
xaxis=dict(title='<b>Day</b>', categoryorder='array', categoryarray=day_order),
yaxis=dict(title='<b>Number of Complaints</b>'), title_x=0.5)
# Create figure
fig = go.Figure(data=traces, layout=layout)
# Show the plot
fig.show()
# Convert date columns to datetime format
Int_df_merged['Date Logged'] = pd.to_datetime(Int_df_merged['Date Logged'], errors='coerce')
Int_df_merged['Date Comp'] = pd.to_datetime(Int_df_merged['Date Comp'], errors='coerce')
# Extract relevant temporal features for logged complaints
Int_df_merged['Year_comp_log'] = Int_df_merged['Date Logged'].dt.year
Int_df_merged['Day of Date Logged'] = Int_df_merged['Date Logged'].dt.day_name()
# Extract relevant temporal features for solved complaints
Int_df_merged['Year_comp_solved'] = Int_df_merged['Date Comp'].dt.year
Int_df_merged['Day of Date Comp'] = Int_df_merged['Date Comp'].dt.day_name()
# Count the number of complaints per year and day for logged complaints
complaints_logged_per_day = Int_df_merged.groupby(['Year_comp_log', 'Day of Date Logged']).size().reset_index(name='Logged_Count')
# Count the number of complaints per year and day for solved complaints
complaints_solved_per_day = Int_df_merged.groupby(['Year_comp_solved', 'Day of Date Comp']).size().reset_index(name='Solved_Count')
# Ensure proper sorting of days
day_order = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
complaints_logged_per_day['Day of Date Logged'] = pd.Categorical(complaints_logged_per_day['Day of Date Logged'], categories=day_order, ordered=True)
complaints_logged_per_day = complaints_logged_per_day.sort_values(['Year_comp_log', 'Day of Date Logged'])
complaints_solved_per_day['Day of Date Comp'] = pd.Categorical(complaints_solved_per_day['Day of Date Comp'], categories=day_order, ordered=True)
complaints_solved_per_day = complaints_solved_per_day.sort_values(['Year_comp_solved', 'Day of Date Comp'])
# Convert year columns to integers in the grouped DataFrames
complaints_solved_per_day['Year_comp_solved'] = complaints_solved_per_day['Year_comp_solved'].astype(int)
# Create figure
fig = go.Figure()
# Add traces for logged complaints
for year in sorted(complaints_logged_per_day['Year_comp_log'].unique()):
df_year = complaints_logged_per_day[complaints_logged_per_day['Year_comp_log'] == year]
fig.add_trace(go.Scatter(x=df_year['Day of Date Logged'], y=df_year['Logged_Count'],
mode='lines+markers', name=f'Logged {year}',
hovertemplate='%{y} Complaints Logged<br>Day: %{x}'))
# Add traces for solved complaints
for year in sorted(complaints_solved_per_day['Year_comp_solved'].unique()):
df_year = complaints_solved_per_day[complaints_solved_per_day['Year_comp_solved'] == year]
fig.add_trace(go.Scatter(x=df_year['Day of Date Comp'], y=df_year['Solved_Count'],
mode='lines+markers', name=f'Solved {year}',
hovertemplate='%{y} Complaints Solved<br>Day: %{x}'))
# Update layout
fig.update_layout(title='<b>Avg. No. of Repair Complaints Logged vs Solved Over Years and WeekDays</b>', showlegend=True,
xaxis=dict(title='<b>Day</b>', categoryorder='array', categoryarray=day_order),
yaxis=dict(title='<b>Number of Complaints</b>'), title_x=0.5)
# Show the plot
fig.show()
# Calculate the mean total repair value for each year of build
mean_values = Int_df_merged.groupby('Year of Build Date')['Total Value'].mean().reset_index()
# Set the size of the plot
plt.figure(figsize=(14, 8))
# Create a clustered bar plot using Seaborn
ax = sns.barplot(
x='Year of Build Date',
y='Total Value',
data=Int_df_merged,
estimator=np.mean, # This specifies the estimator (mean in this case)
errorbar=None # This removes confidence intervals
)
# Add rounded mean values above the bars
for p, mean_value in zip(ax.patches, mean_values['Total Value']):
ax.annotate(f'{int(round(mean_value))}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 10), textcoords='offset points', fontsize=10)
# Add title and labels with bold font
plt.title('Year of Build Date vs. Mean Total Value - Clustered Bar Plot', fontweight='bold')
# Make x-axis and y-axis labels bold
plt.xlabel('Year of Build Date', fontweight='bold')
plt.ylabel('Mean Total Value', fontweight='bold')
# Rotate x-axis labels for better visibility
plt.xticks(rotation=45, ha='right')
# Show the plot
plt.show()
# Convert date columns to datetime format with appropriate date formats
date_columns = {
'Year of Build Date': '%Y', # Assuming this column only contains year information
'Date Logged': '%d/%m/%Y', # Assuming this column contains full date information
'Date Comp': '%d/%m/%Y' # Assuming this column contains full date information
}
for col, date_format in date_columns.items():
Int_df_merged[col] = pd.to_datetime(Int_df_merged[col], format=date_format, errors='coerce')
# Extract relevant temporal features with new variable names
Int_df_merged['Year_comp_log'] = Int_df_merged['Date Logged'].dt.year
Int_df_merged['Month_comp_log'] = Int_df_merged['Date Logged'].dt.month
Int_df_merged['Day_comp_log'] = Int_df_merged['Date Logged'].dt.day
# Plotting temporal trends
fig, axes = plt.subplots(3, 1, figsize=(12, 18))
# Plotting repair complaints over years
ax1 = sns.countplot(x='Year_comp_log', data=Int_df_merged, ax=axes[0])
axes[0].set_title('Number of Repair Complaints Over Years')
axes[0].set_xlabel('Year_comp_log')
axes[0].set_ylabel('Number of Complaints')
# Annotate bars with count values (remove decimal values)
for p in ax1.patches:
ax1.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 10), textcoords='offset points')
# Calculate % diff in yearly values
yearly_counts = Int_df_merged['Year_comp_log'].value_counts()
percentage_diff = ((yearly_counts.max() - yearly_counts.min()) / yearly_counts.max()) * 100
# Display message at the top of the plot
fig.suptitle(f'Percentage Difference in Yearly Values: {percentage_diff:.2f}%')
# Plotting repair complaints over months
ax2 = sns.countplot(x='Month_comp_log', data=Int_df_merged, ax=axes[1])
axes[1].set_title('Number of Repair Complaints Over Months')
axes[1].set_xlabel('Month_comp_log')
axes[1].set_ylabel('Number of Complaints')
# Annotate bars with count values (remove decimal values)
for p in ax2.patches:
ax2.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 10), textcoords='offset points')
# Plotting repair complaints over days
ax3 = sns.countplot(x='Day_comp_log', data=Int_df_merged, ax=axes[2])
axes[2].set_title('Number of Repair Complaints Over Days of the Month')
axes[2].set_xlabel('Day_comp_log')
axes[2].set_ylabel('Number of Complaints')
# Annotate bars with count values (remove decimal values)
for p in ax3.patches:
ax3.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 10), textcoords='offset points')
plt.tight_layout(rect=[0, 0.03, 1, 0.95]) # Adjust layout to make room for the suptitle
plt.show()
Bivariate Analysis - Dash App - User customised selection of featuresΒΆ
This is a Dash-Interactive app. Please note that for interaction to be actionable, it needs to be hosted on some service providers platform as a standalone web application.
Specify the fields you want to include in the dropdownsΒΆ
dropdown_fields = [ 'JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Jobsourcedescription', 'Property Type', 'Initial Priority Description', 'Mgt Area', 'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC', 'JOB_STATUS_DESCRIPTION', 'Latest Priority Description' ]
Initialize the Dash appΒΆ
app = dash.Dash(name)
Define the layout of the appΒΆ
app.layout = html.Div([ html.H1("Heatmap Dashboard"),
# Dropdown for y-axis field
dcc.Dropdown(
id='y-axis-dropdown',
options=[{'label': col, 'value': col} for col in dropdown_fields],
value=dropdown_fields[0],
multi=False,
style={'width': '50%'}
),
# Dropdown for x-axis field
dcc.Dropdown(
id='x-axis-dropdown',
options=[{'label': col, 'value': col} for col in dropdown_fields],
value=dropdown_fields[1],
multi=False,
style={'width': '50%'}
),
# Placeholder for the selected fields
html.Div(id='selected-fields'),
# Heatmap plot based on selected fields
dcc.Graph(id='heatmap-plot')
])
Define callback to update selected fields textΒΆ
@app.callback( Output('selected-fields', 'children'), [Input('y-axis-dropdown', 'value'), Input('x-axis-dropdown', 'value')] ) def update_selected_fields(y_axis_field, x_axis_field): # Check if both selected fields are different if y_axis_field == x_axis_field: return "Please select different fields." else: return f"Selected Fields: {y_axis_field}, {x_axis_field}"
Define callback to update heatmap plot based on selected fieldsΒΆ
@app.callback( Output('heatmap-plot', 'figure'), [Input('y-axis-dropdown', 'value'), Input('x-axis-dropdown', 'value')] ) def update_heatmap_plot(y_axis_field, x_axis_field): # Calculate counts and percentages ct = pd.crosstab(Int_df_merged[y_axis_field], Int_df_merged[x_axis_field]) percentages = ct.div(ct.sum(axis=1), axis=0) * 100
# Create a heatmap plot based on the selected fields
fig = px.imshow(
ct,
text_auto=".2f",
color_continuous_scale='Blues',
aspect="auto"
)
# Customize hover text
hover_template = (
f"<b>{y_axis_field}:</b> %{{y}}<br>"
f"<b>{x_axis_field}:</b> %{{x}}<br>"
"<b>Count:</b> %{z}<br>"
"<b>Percentage:</b> %{text:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
text=percentages.values,
textfont_size=8
)
# Set axis labels
fig.update_xaxes(title_text=x_axis_field, title_font=dict(size=18, family='Arial', color='black'))
fig.update_yaxes(title_text=y_axis_field, title_font=dict(size=18, family='Arial', color='black'))
# Update layout for overall size
fig.update_layout(
height=800, # Adjust the height as needed
width=1200, # Adjust the width as needed
)
return fig
if name == 'main': app.run_server(debug=True, port=8050)
Stacked Bar Chart- Dash Interactive App - Comparison of different Categories within GroupsΒΆ
This is a Dash-Interactive app. Please note that for interaction it needs to be hosted on some service providers platform as a standalone web application.
# List of categorical columns for visualization
categorical_columns = ['JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Property Type', 'Jobsourcedescription',
'Initial Priority Description', 'Latest Priority Description', 'JOB_STATUS_DESCRIPTION',
'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC', 'SOR_DESCRIPTION']
# Initialize the Dash app
app = dash.Dash(__name__)
# Layout of the app
app.layout = html.Div([
html.Label("Select a categorical column for X-axis:"),
dcc.Dropdown(
id='x-axis-dropdown',
options=[{'label': col, 'value': col} for col in categorical_columns],
value=categorical_columns[0], # Initial selected column
),
html.Label("Select a categorical column for Legend (stacks):"),
dcc.Dropdown(
id='legend-dropdown',
options=[{'label': col, 'value': col} for col in categorical_columns],
value=categorical_columns[1], # Initial selected column
),
dcc.Graph(id='stacked-bar-chart')
])
# Callback to update the stacked bar chart based on the selected dropdowns
@app.callback(
Output('stacked-bar-chart', 'figure'),
[Input('x-axis-dropdown', 'value'),
Input('legend-dropdown', 'value')]
)
def update_stacked_bar_chart(x_axis_column, legend_column):
# Check if both dropdowns have the same value
if x_axis_column == legend_column:
return {
'data': [],
'layout': {
'annotations': [{
'text': 'Please select different values for X-axis and Legend.',
'showarrow': False,
'x': 0.5,
'y': 0.5
}]
}
}
# Calculate counts
counts = Int_df_merged.groupby([x_axis_column, legend_column]).size().unstack()
# Create stacked bar chart with counts displayed on top
fig = go.Figure()
for col in counts.columns:
fig.add_trace(go.Bar(
x=counts.index,
y=counts[col],
text=counts[col].values, # Display count values on top of bars
name=col,
hovertemplate=f'<b>{x_axis_column}:</b> %{{x}}<br>'
f'<b>{legend_column}:</b> {col}<br>'
f'<b>Count:</b> %{{text}}<extra></extra>'
))
# Update layout for aesthetics
fig.update_layout(
height=600,
width=1000,
xaxis=dict(
title=f'<b>{x_axis_column}</b>',
tickmode='array',
tickvals=list(counts.index),
ticktext=list(counts.index)
),
yaxis=dict(title='<b>Count</b>', tickfont=dict(size=14, family='Arial')),
title_text=f'<b>Stacked Bar Chart for {x_axis_column} and {legend_column}</b>',
title_x=0.5, # Centered title
barmode='stack'
)
return fig
# Run the app
if __name__ == '__main__':
app.run_server(debug=True, port=8051)
Job Type vs. Initial Priority Description:ΒΆ
Cross-tabulate the types of repair jobs with their initial priority descriptions to understand the distribution of priority levels for each job type.
1- We can see that the Job Types ("Responsive Repairs" and "Gas Responsive Repairs") that are of "Appointable" priority dominate, follwed by same Job Types ("Responsive Repairs" and "Gas Responsive Repairs") that are of "Emergency" priority.
2- This is followed by "Communal Responsive Repairs" that falls under "Appointable" priority though in reduced nos (=147).
ct_job_type_priority = pd.crosstab(Int_df_merged['JOB_TYPE_DESCRIPTION'], Int_df_merged['Initial Priority Description'])
# Calculate percentages
ct_percentages = ct_job_type_priority.div(ct_job_type_priority.sum(axis=1), axis=0) * 100
# Plotting
fig = px.imshow(ct_job_type_priority, text_auto=".2f", color_continuous_scale='Blues', aspect="auto")
# Update hover template to include percentages
hover_template = (
"<b>Job Type Description:</b> %{y}<br>"
"<b>Initial Priority:</b> %{x}<br>"
"<b>Count:</b> %{z}<br>"
"<b>Percentage:</b> %{customdata:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
customdata=ct_percentages.values,
textfont_size=8
)
# Increase the size of the graph
fig.update_layout(
width=900,
height=600,
font=dict(size=10),
title_text="<b>Job Type Description vs. Initial Priority</b>",
title_x=0.5,
)
# Enable zoom and pan options
fig.update_layout(
margin=dict(r=10, t=25, b=40, l=60),
hovermode="closest",
uirevision="same",
)
# Set the category order for both x-axis and y-axis
x_axis_order = ct_job_type_priority.columns.tolist()
y_axis_order = ct_job_type_priority.index.tolist()
fig.update_layout(
xaxis=dict(categoryorder='array', categoryarray=x_axis_order),
yaxis=dict(categoryorder='array', categoryarray=y_axis_order)
)
# Make x-axis and y-axis labels bold using HTML tags
fig.update_layout(
yaxis_title="<b>Job Type Description</b>",
xaxis_title="<b>Initial Priority</b>",
)
fig.show()
Job Type vs. Latest Priority Description:ΒΆ
Cross-tabulate the types of repair jobs with their latest priority descriptions to understand the distribution of priority levels for each job type.
1- The latest priority of these job types stays same as encouneterd with initial priority types, though the nos tend to increase as expectedly so during the time span as new jobs gets added.
ct_job_type_lpriority = pd.crosstab(Int_df_merged['JOB_TYPE_DESCRIPTION'], Int_df_merged['Latest Priority Description'])
# Calculate percentages
ct_percentages = ct_job_type_lpriority.div(ct_job_type_lpriority.sum(axis=1), axis=0) * 100
# Plotting
fig = px.imshow(ct_job_type_lpriority, text_auto=".2f", color_continuous_scale='Blues', aspect="auto")
# Update hover template to include percentages
hover_template = (
"<b>Job Type:</b> %{y}<br>"
"<b>Latest Priority:</b> %{x}<br>"
"<b>Count:</b> %{z}<br>"
"<b>Percentage:</b> %{customdata:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
customdata=ct_percentages.values,
textfont_size=7
)
# Increase the size of the graph
fig.update_layout(
width=900,
height=600,
font=dict(size=10),
title_text="<b>Job Type vs. Latest Priority</b>",
title_x=0.5,
)
# Enable zoom and pan options
fig.update_layout(
scene=dict(
aspectmode="manual",
aspectratio=dict(x=1, y=1, z=1),
),
margin=dict(r=10, t=25, b=40, l=60),
hovermode="closest",
scene_camera=dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.25, y=1.25, z=1.25),
),
uirevision="same",
)
# Make x-axis and y-axis labels bold using HTML tags
fig.update_layout(
xaxis_title="<b>Latest Priority</b>",
yaxis_title="<b>Job Type Description</b>",
)
fig.show()
Property Type vs. Initial Priority Description:ΒΆ
Examine how the initial priority of repairs varies across different property types.
1- We can see that the "Terrace", "End Terrace", and "Access Direct" properties with "Appointable" and "Emergency" Repair priorties dominate. 2- We also see that for many these properties types the priority has not been updated as they are missing in the data.
ct_prop_type_priority = pd.crosstab(Int_df_merged['Property Type'], Int_df_merged['Initial Priority Description'])
# Calculate percentages
ct_percentages = ct_prop_type_priority.div(ct_prop_type_priority.sum(axis=1), axis=0) * 100
# Plotting
fig = px.imshow(ct_prop_type_priority, text_auto=".2f", color_continuous_scale='Blues', aspect="auto")
# Update hover template to include percentages
hover_template = (
"<b>Property Type:</b> %{y}<br>"
"<b>Initial Priority:</b> %{x}<br>"
"<b>Count:</b> %{z}<br>"
"<b>Percentage:</b> %{customdata:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
customdata=ct_percentages.values,
textfont_size=6
)
# Increase the size of the graph
fig.update_layout(
width=900,
height=600,
font=dict(size=10),
title_text="<b>Property Type vs. Initial Priority</b>",
title_x=0.5,
)
# Enable zoom and pan options
fig.update_layout(
scene=dict(
aspectmode="manual",
aspectratio=dict(x=1, y=1, z=1),
),
margin=dict(r=10, t=25, b=40, l=60),
hovermode="closest",
scene_camera=dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.25, y=1.25, z=1.25),
),
uirevision="same",
)
# Make x-axis and y-axis labels bold using HTML tags
fig.update_layout(
xaxis_title="<b>Initial Priority</b>",
yaxis_title="<b>Property Type</b>",
)
fig.show()
Property Type vs. Trade Description:ΒΆ
Investigate if there are certain linkages between certain types of properties and the repair task that needs to be carried out.
1- "Terrace", "End Terrace", "Access Direct", and "Access via internal shared area" type properties are more prone to repairs type tasks like "Gas Repairs", "Carepentry", "Plumbing", and "Electricity Repair" type tasks.
2- In general, "Terrace" and "End Terrace" type of properties have most repair requests.
ct_prop_type_trade_desc = pd.crosstab(Int_df_merged['Property Type'], Int_df_merged['TRADE_DESCRIPTION'])
# Calculate percentages
ct_percentages = ct_prop_type_trade_desc.div(ct_prop_type_trade_desc.sum(axis=1), axis=0) * 100
# Plotting
fig = px.imshow(ct_prop_type_trade_desc, text_auto=".2f", color_continuous_scale='Blues', aspect="auto")
# Update hover template to include percentages
hover_template = (
"<b>Property Type:</b> %{y}<br>"
"<b>Trade Description:</b> %{x}<br>"
"<b>Count:</b> %{z}<br>"
"<b>Percentage:</b> %{customdata:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
customdata=ct_percentages.values,
textfont_size=7
)
# Increase the size of the graph
fig.update_layout(
width=900,
height=600,
font=dict(size=10),
title_text="<b>Property Type vs. Trade Description</b>",
title_x=0.5,
)
# Enable zoom and pan options
fig.update_layout(
scene=dict(
aspectmode="manual",
aspectratio=dict(x=1, y=1, z=1),
),
margin=dict(r=10, t=25, b=40, l=60),
hovermode="closest",
scene_camera=dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.25, y=1.25, z=1.25),
),
uirevision="same",
)
# Make x-axis label bold
fig.update_xaxes(title_text='Trade Description', title_font=dict(size=18, family='Arial', color='black'))
# Make y-axis label bold
fig.update_yaxes(title_text='Property Type', title_font=dict(size=18, family='Arial', color='black'))
# Show the plot
fig.show()
Property Type vs. Abandon Reason Description:ΒΆ
Investigate if there are any type of properties that are more prone to abandonment.
1- Here we can see primarily the "Terrace", "End Terrace", "Access direct", and "Access via internal shared area" type of properties are being abanadoned for repair service in that order.
2- Though "Access via internal shared area" and "Semi Detached" properties are being abandoned on a much lower scale as compared to other mentioned property types.
3- The primary reasons for abandoning the service are "No work required", "Alternative Job", "No Access" , "Duplicate Order", and "Tenanr Missed Apt".
4- Also, "Tenant Refusal", and "Input Error" are other reasons for task abandonment though on a very low scale.
5- Interstingly, we see here "Access via internal shared area" are being abandoned due to wrong contractor assignment (=94).
ct_prop_type_abandon_reason_desc = pd.crosstab(Int_df_merged['Property Type'], Int_df_merged['ABANDON_REASON_DESC'])
# Calculate percentages
ct_percentages = ct_prop_type_abandon_reason_desc.div(ct_prop_type_abandon_reason_desc.sum(axis=1), axis=0) * 100
# Plotting
fig = px.imshow(ct_prop_type_abandon_reason_desc, text_auto=".2f", color_continuous_scale='Blues', aspect="auto")
# Update hover template to include percentages
hover_template = (
"<b>Property Type:</b> %{y}<br>"
"<b>Abandon Reason Description:</b> %{x}<br>"
"<b>Count:</b> %{z}<br>"
"<b>Percentage:</b> %{customdata:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
customdata=ct_percentages.values,
textfont_size=7
)
# Increase the size of the graph
fig.update_layout(
width=900,
height=600,
font=dict(size=10),
title_text="<b>Property Type vs. Abandon Reason Description</b>",
title_x=0.5,
)
# Enable zoom and pan options
fig.update_layout(
scene=dict(
aspectmode="manual",
aspectratio=dict(x=1, y=1, z=1),
),
margin=dict(r=10, t=25, b=40, l=60),
hovermode="closest",
scene_camera=dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.25, y=1.25, z=1.25),
),
uirevision="same",
)
# Make x-axis label bold
fig.update_xaxes(title_text='Abandon Reason Description', title_font=dict(size=18, family='Arial', color='black'))
# Make y-axis label bold
fig.update_yaxes(title_text='Property Type', title_font=dict(size=18, family='Arial', color='black'))
# Show the plot
fig.show()
Property Type vs. Job Status Description:ΒΆ
Analyze the relationship between the property type and the status of repair jobs.
1- We can see that "Invoice Accepted" is the dominant job status for ("Terrace", "End Terrace", "Access Direct", and "Access via internal shared area") type of properties have.
2- It seems this is predominantly due to the fact that these properties have high volume of repair requests.
3- These 4 types of properties are also abandoned the most for repair service.
ct_prop_type_job_status = pd.crosstab(Int_df_merged['Property Type'], Int_df_merged['JOB_STATUS_DESCRIPTION'])
# Calculate percentages
ct_percentages = ct_prop_type_job_status.div(ct_prop_type_job_status.sum(axis=1), axis=0) * 100
# Plotting
fig = px.imshow(ct_prop_type_job_status, text_auto=".2f", color_continuous_scale='Blues', aspect="auto")
# Update hover template to include percentages
hover_template = (
"<b> Property Type:</b> %{y}<br>"
"<b>Job Status:</b> %{x}<br>"
"<b>Count:</b> %{z}<br>"
"<b>Percentage:</b> %{customdata:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
customdata=ct_percentages.values,
textfont_size=10
)
# Increase the size of the graph
fig.update_layout(
width=900,
height=600,
font=dict(size=10),
title_text="<b>Property Type vs. Job Status</b>",
title_x=0.5,
)
# Enable zoom and pan options
fig.update_layout(
scene=dict(
aspectmode="manual",
aspectratio=dict(x=1, y=1, z=1),
),
margin=dict(r=10, t=25, b=40, l=60),
hovermode="closest",
scene_camera=dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.25, y=1.25, z=1.25),
),
uirevision="same",
)
# Make x-axis label bold
fig.update_xaxes(title_text='Job Status', title_font=dict(size=18, family='Arial', color='black'))
# Make y-axis label bold
fig.update_yaxes(title_text='Property Type', title_font=dict(size=18, family='Arial', color='black'))
# Show the plot
fig.show()
Mgt Area vs. Abandon Reason Description:ΒΆ
Explore the distribution of abandon reason across different management areas, and see if these work abandonments are concentrated across any specfic mangagement area dealing with contractors or there is no such pattern.
1- We can see here that "MA1" is being an outlier in the form of most properties under it are being abandoned by the contractors for different reasons as mentioned below.
2 - Most importantly, we can see here that, Mgt Area (MA1) is abandoing the property repair service requests primarilry due to reasons ("No work required", "Alternative Job", "No Access", "Duplicate Order", and "Tenant Missed Apt" reasons.
2- We need to priortise the focus on management ("MA1") to know the reasons for abandonment and improve the service level.
3- Also, need to understand why there are disproportionately higher number of requests are being routed through "MA1". This will allow for potential optimal allocation of resources, skilled resource augmentation etc.
ct_mgt_abandon_desc = pd.crosstab(Int_df_merged['Mgt Area'], Int_df_merged['ABANDON_REASON_DESC'])
# Calculate percentages
ct_percentages = ct_mgt_abandon_desc.div(ct_mgt_abandon_desc.sum(axis=1), axis=0) * 100
# Plotting
fig = px.imshow(ct_mgt_abandon_desc, text_auto=".2f", color_continuous_scale='Blues', aspect="auto")
# Update hover template to include percentages
hover_template = (
"<b>Mgt Area:</b> %{y}<br>"
"<b>Abandon Reason Description:</b> %{x}<br>"
"<b>Count:</b> %{z}<br>"
"<b>Percentage:</b> %{customdata:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
customdata=ct_percentages.values,
textfont_size=5
)
# Increase the size of the graph
fig.update_layout(
width=900,
height=600,
font=dict(size=10),
title_text="<b>Mgt Area vs. Abandon Reason Description</b>",
title_x=0.5,
)
# Enable zoom and pan options
fig.update_layout(
scene=dict(
aspectmode="manual",
aspectratio=dict(x=1, y=1, z=1),
),
margin=dict(r=10, t=25, b=40, l=60),
hovermode="closest",
scene_camera=dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.25, y=1.25, z=1.25),
),
uirevision="same",
)
# Make x-axis label bold
fig.update_xaxes(title_text='Abandon Reason Description', title_font=dict(size=18, family='Arial', color='black'))
# Make y-axis label bold
fig.update_yaxes(title_text='Mgt Area', title_font=dict(size=18, family='Arial', color='black'))
# Show the plot
fig.show()
Mgt Area vs. Trade Description:ΒΆ
Explore the distribution of repair trades across different management areas.
1- We know from earlier that Repair Trades like ( "Gas Repairs", "Capentry", "Plumbing", and "Electric Repairs") are the dominant trades that are abandoned the most in ("Terrace", and "End Terrace") type properties.
2- We see here that Mgt Area ("MA1" ) is highly engaged with these skilled trades than other managements.
3- This calls for more analysis to understand whether the MA1 is overloaded with bulk load requests, and more work load balancing is required in the form of optimal resource allocation or skill augmentation with more skilled contractors.
ct_mgt_trade_desc = pd.crosstab(Int_df_merged['Mgt Area'], Int_df_merged['TRADE_DESCRIPTION'])
# Calculate percentages
ct_percentages = ct_mgt_trade_desc.div(ct_mgt_trade_desc.sum(axis=1), axis=0) * 100
# Plotting
fig = px.imshow(ct_mgt_trade_desc, text_auto=".2f", color_continuous_scale='Blues', aspect="auto")
# Update hover template to include percentages
hover_template = (
"<b>Mgt Area:</b> %{y}<br>"
"<b>Trade Description:</b> %{x}<br>"
"<b>Count:</b> %{z}<br>"
"<b>Percentage:</b> %{customdata:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
customdata=ct_percentages.values,
textfont_size=5
)
# Increase the size of the graph
fig.update_layout(
width=900,
height=600,
font=dict(size=10),
title_text="<b>Mgt Area vs. Trade Description</b>",
title_x=0.5,
)
# Enable zoom and pan options
fig.update_layout(
scene=dict(
aspectmode="manual",
aspectratio=dict(x=1, y=1, z=1),
),
margin=dict(r=10, t=25, b=40, l=60),
hovermode="closest",
scene_camera=dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.25, y=1.25, z=1.25),
),
uirevision="same",
)
# Make x-axis label bold
fig.update_xaxes(title_text='Trade Description', title_font=dict(size=18, family='Arial', color='black'))
# Make y-axis label bold
fig.update_yaxes(title_text='Mgt Area', title_font=dict(size=18, family='Arial', color='black'))
# Show the plot
fig.show()
Trade Description vs. Abandon Reason Description:ΒΆ
Examine the reasons for abandoning repair jobs within specific trade categories.
1- "Gas Repair", "Carpentry", "Plumbing", and "Electric Repairs" jobs are being abandoned predominantly in that order.
2- The primary reasons being "Alternative Job", "No Access", "No Work required", "Duplicate Order", though "Tenant Missed Apt", "Tenant Refusal" are other reasons but on a very much lower scale in comparsion to others.
ct_trade_desc_abandon = pd.crosstab(Int_df_merged['TRADE_DESCRIPTION'], Int_df_merged['ABANDON_REASON_DESC'])
# Calculate percentages
ct_percentages = ct_trade_desc_abandon.div(ct_trade_desc_abandon.sum(axis=1), axis=0) * 100
# Plotting
fig = px.imshow(ct_trade_desc_abandon, text_auto=".2f", color_continuous_scale='Blues', aspect="auto")
# Update hover template to include percentages
hover_template = (
"<b>Trade Description:</b> %{y}<br>"
"<b>Anandon Reason Description:</b> %{x}<br>"
"<b>Count:</b> %{z}<br>"
"<b>Percentage:</b> %{customdata:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
customdata=ct_percentages.values,
textfont_size=8
)
# Increase the size of the graph
fig.update_layout(
width=900,
height=600,
font=dict(size=10),
title_text="<b>Trade Description vs. Abandon Reason Description</b>",
title_x=0.5,
)
# Enable zoom and pan options
fig.update_layout(
scene=dict(
aspectmode="manual",
aspectratio=dict(x=1, y=1, z=1),
),
margin=dict(r=10, t=25, b=40, l=60),
hovermode="closest",
scene_camera=dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.25, y=1.25, z=1.25),
),
uirevision="same",
)
# Make x-axis label bold
fig.update_xaxes(title_text='Abandon Reason Description', title_font=dict(size=18, family='Arial', color='black'))
# Make y-axis label bold
fig.update_yaxes(title_text='Trade Description', title_font=dict(size=18, family='Arial', color='black'))
# Show the plot
fig.show()
Initial Priority Description vs. Job Status Description:ΒΆ
Explore the relationship between the initial priority of repairs and their current status.
1- Repair tasks that are logged with ("Appointable", "Emergency", and "Urgent PFI Evolve RD Irvine EMB") priorties have most no of "invoices accepted" with ""Appointable" priority types leading the list.
2- Understandably, these priority tasks ("Appointable", "Emergency", and "Urgent PFI Evolve RD Irvine EMB") also lead the list of abandoned type tasks ( 1645, 502, 457) no of tasks anadoned.
3- Notably, As seen here these priortity type jobs though are in large numbers, but they seems to be in unresolved status as refelected in job status ("Work Completed") with very negligible completion nos (41, 29, 18) respectively.
ct_ip_jobstatus_desc = pd.crosstab(Int_df_merged['Initial Priority Description'], Int_df_merged['JOB_STATUS_DESCRIPTION'])
# Calculate percentages
ct_percentages = ct_ip_jobstatus_desc.div(ct_ip_jobstatus_desc.sum(axis=1), axis=0) * 100
# Plotting
fig = px.imshow(ct_ip_jobstatus_desc, text_auto=".2f", color_continuous_scale='Blues', aspect="auto")
# Update hover template to include percentages
hover_template = (
"<b>Initial Priority Description:</b> %{y}<br>"
"<b>Job Status Description:</b> %{x}<br>"
"<b>Count:</b> %{z}<br>"
"<b>Percentage:</b> %{customdata:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
customdata=ct_percentages.values,
textfont_size=8
)
# Increase the size of the graph
fig.update_layout(
width=900,
height=600,
font=dict(size=10),
title_text="<b>Initial Priority Description vs. Job Status Description</b>",
title_x=0.5,
)
# Enable zoom and pan options
fig.update_layout(
scene=dict(
aspectmode="manual",
aspectratio=dict(x=1, y=1, z=1),
),
margin=dict(r=10, t=25, b=40, l=60),
hovermode="closest",
scene_camera=dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.25, y=1.25, z=1.25),
),
uirevision="same",
)
# Make x-axis label bold
fig.update_xaxes(title_text='Job Status Description', title_font=dict(size=18, family='Arial', color='black'))
# Make y-axis label bold
fig.update_yaxes(title_text='Initial Priority Description', title_font=dict(size=18, family='Arial', color='black'))
# Show the plot
fig.show()
Latest Priority Description vs. Job Status Description:ΒΆ
Explore the relationship between the latest priority of repairs and their current status.
1- Repair tasks that have final priorties ("Appointable", "Emergency", and "Urgent PFI Evolve RD Irvine EMB") have most no of "invoices accepted" with ""Appointable" priority types leading the list. 2- Understandably, these priority tasks ("Appointable", "Urgent PFI Evolve RD Irvine EMB" and "Emergency") also lead the list of abandoned type tasks ( 2331, 543,533) no of tasks abandoned. 3- Notably, As seen here these priortity type jobs ("Appointable", "Emergency") though are in large numbers, but they seems to be in unresolved status as refelected in job status ("Work Completed") with very negligible completion nos (61, 46) respectively. 4- We can notice, there is a gradual increase in the no. of jobs getting registered as ("Appointable", "Emergency", and "Urgent PFI Evolve RD Irvine EMB") possibly due to accumulation of their incompletion status and addition of new job requests.
ct_lp_jobstatus_desc = pd.crosstab(Int_df_merged['Latest Priority Description'], Int_df_merged['JOB_STATUS_DESCRIPTION'])
# Calculate percentages
ct_percentages = ct_ip_jobstatus_desc.div(ct_lp_jobstatus_desc.sum(axis=1), axis=0) * 100
# Plotting
fig = px.imshow(ct_lp_jobstatus_desc, text_auto=".2f", color_continuous_scale='Blues', aspect="auto")
# Update hover template to include percentages
hover_template = (
"<b>Latest Priority Description:</b> %{y}<br>"
"<b>Job Status Description:</b> %{x}<br>"
"<b>Count:</b> %{z}<br>"
"<b>Percentage:</b> %{customdata:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
customdata=ct_percentages.values,
textfont_size=8
)
# Increase the size of the graph
fig.update_layout(
width=900,
height=600,
font=dict(size=10),
title_text="<b>Latest Priority Description vs. Job Status Description</b>",
title_x=0.5,
)
# Enable zoom and pan options
fig.update_layout(
scene=dict(
aspectmode="manual",
aspectratio=dict(x=1, y=1, z=1),
),
margin=dict(r=10, t=25, b=40, l=60),
hovermode="closest",
scene_camera=dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.25, y=1.25, z=1.25),
),
uirevision="same",
)
# Make x-axis label bold
fig.update_xaxes(title_text='Job Status Description', title_font=dict(size=18, family='Arial', color='black'))
# Make y-axis label bold
fig.update_yaxes(title_text='Latest Priority Description', title_font=dict(size=18, family='Arial', color='black'))
# Show the plot
fig.show()
Property Type vs. Contractor:ΒΆ
1- We can see here that though there are in total 30 contractors being requisitioned for various repair jobs, predominantly only two contractors very few no. of (=4) contractors (27,16,5,and 29) are predominantly being tasked with "terrace" property types which has the most no of repair requests.
2- Similarly, very few no of contractors (=7) are being utililised for "End Terrace" property types.
3- Most of the contractors we can see are being deployed for few repair tasks (e,g, contractor 13, 2, 29 and 21 and others).
4- It could be due to the fact that many contractors don't have necessary expertise for certain job roles like as repairs, carpentry, plumbing, and electric repairs or they are being underutlized. This needs to be further investigated.
5- This calls for either augmenting the contractors resource pool with requsite skilled agencies or to optimize the allocation of existing resources in case of underutilisation or enhancing the existing contractor's skills to access the "Terrace" type properties. This also needs to be further investigated.
6- Interestingly, We can see that almost all service requests for "terraced", "End terraced", "Semi Detached", "Access direct", and "Access direct via internal shared area" properties are being routed through contractor without any ID ("N/A").
7- This seems potentially suspicious or it could be a data entry error, or anoymised which needs to be investigated.
ct_proptyp_cont_desc = pd.crosstab(Int_df_merged['Property Type'], Int_df_merged['CONTRACTOR'])
# Calculate percentages
ct_percentages = ct_proptyp_cont_desc.div(ct_proptyp_cont_desc.sum(axis=1), axis=0) * 100
# Plotting
fig = px.imshow(ct_proptyp_cont_desc, text_auto=".2f", color_continuous_scale='Blues', aspect="auto")
# Update hover template to include percentages
hover_template = (
"<b>Property Type:</b> %{y}<br>"
"<b>Contractor:</b> %{x}<br>"
"<b>Count:</b> %{z}<br>"
"<b>Percentage:</b> %{customdata:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
customdata=ct_percentages.values,
textfont_size=8
)
# Increase the size of the graph
fig.update_layout(
width=900,
height=600,
font=dict(size=10),
title_text="<b>Property Type vs. Contractor</b>",
title_x=0.5,
)
# Enable zoom and pan options
fig.update_layout(
scene=dict(
aspectmode="manual",
aspectratio=dict(x=1, y=1, z=1),
),
margin=dict(r=10, t=25, b=40, l=60),
hovermode="closest",
scene_camera=dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.25, y=1.25, z=1.25),
),
uirevision="same",
)
# Make x-axis label bold
fig.update_xaxes(title_text='Contractor', title_font=dict(size=18, family='Arial', color='black'))
# Make y-axis label bold
fig.update_yaxes(title_text='Property Type', title_font=dict(size=18, family='Arial', color='black'))
# Show the plot
fig.show()
Job Type vs. Job Status:ΒΆ
Analyze how different job types correspond to different job statuses. This can help in understanding the completion status of different types of repair jobs.
1- As we see, "Responsive Repairs" and "Gas Responsive Repairs" are the dominant job types with large no of "invoice accepted", and these are also the highly abandoned jobs in the list with lesser completions in comparsion to their "invoice accepted" status job status.
ct_jobtype_and_status = pd.crosstab(Int_df_merged['JOB_TYPE_DESCRIPTION'], Int_df_merged['JOB_STATUS_DESCRIPTION'])
# Calculate percentages
ct_percentages = ct_jobtype_and_status.div(ct_jobtype_and_status.sum(axis=1), axis=0) * 100
# Plotting
fig = px.imshow(ct_jobtype_and_status, text_auto=".2f", color_continuous_scale='Blues', aspect="auto")
# Update hover template to include percentages
hover_template = (
"<b>Job Type:</b> %{y}<br>"
"<b>Job Status:</b> %{x}<br>"
"<b>Count:</b> %{z}<br>"
# "<b>Percentage:</b> %{z:.2f}%"
"<b>Percentage:</b> %{customdata:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
customdata=ct_percentages.values,
textfont_size=8
)
# Increase the size of the graph
fig.update_layout(
width=900,
height=600,
font=dict(size=10),
title_text="<b>Job Type vs. Job Status Description</b>",
title_x=0.5,
)
# Enable zoom and pan options
fig.update_layout(
scene=dict(
aspectmode="manual",
aspectratio=dict(x=1, y=1, z=1),
),
margin=dict(r=10, t=25, b=40, l=60),
hovermode="closest",
scene_camera=dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.25, y=1.25, z=1.25),
),
uirevision="same",
)
# Make x-axis label bold
fig.update_xaxes(title_text='Job Status', title_font=dict(size=18, family='Arial', color='black'))
# Make y-axis label bold
fig.update_yaxes(title_text='Job Type', title_font=dict(size=18, family='Arial', color='black'))
# Manually set y-axis tick values
fig.update_yaxes(tickvals=list(range(len(ct_jobtype_and_status.index))), ticktext=ct_jobtype_and_status.index)
# Show the plot
fig.show()
Contractor vs. Abandon Reason Description:ΒΆ
Investigate to know any pattern of abandon reasons given by contractors . This can highlight areas where certain contractors may need additional support or training.
1- We can see now predominantly that almost 10 contractrors (out of total 30 contractors) having abandoned the property to service due to its inaccessibility, which resulted in a pile up of request backlogs with large no of incomplete jobs.
2- This narrows down the scope to the checking the reasons primarily for "terrace" and "end terrace" type properties being inaccesible due to either lack of contractor's skill levels or other potential reasons.
3- As confirmed earlier, the bulkload of abandonments are by the "N/A" contractor engaged in the skliied trades due to the descriptions as found out earlier.
ct_cont_abandon_desc = pd.crosstab(Int_df_merged['CONTRACTOR'], Int_df_merged['ABANDON_REASON_DESC'])
# Calculate percentages
ct_percentages = ct_cont_abandon_desc.div(ct_cont_abandon_desc.sum(axis=1), axis=0) * 100
# Plotting
fig = px.imshow(ct_cont_abandon_desc, text_auto=".2f", color_continuous_scale='Blues', aspect="auto")
# Update hover template to include percentages
hover_template = (
"<b>Contractor:</b> %{y}<br>"
"<b>Abandon Reason Description:</b> %{x}<br>"
"<b>Count:</b> %{z}<br>"
"<b>Percentage:</b> %{customdata:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
customdata=ct_percentages.values,
textfont_size=8
)
# Increase the size of the graph
fig.update_layout(
width=900,
height=600,
font=dict(size=10),
title_text="<b>Contractor vs. Abandon Reason Desc</b>",
title_x=0.5,
)
# Enable zoom and pan options
fig.update_layout(
scene=dict(
aspectmode="manual",
aspectratio=dict(x=1, y=1, z=1),
),
margin=dict(r=10, t=25, b=40, l=60),
hovermode="closest",
scene_camera=dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.25, y=1.25, z=1.25),
),
uirevision="same",
)
# Make x-axis label bold
fig.update_yaxes(title_text='Contractor', title_font=dict(size=18, family='Arial', color='black'))
# Make y-axis label bold
fig.update_xaxes(title_text='Abandon Reason Desc', title_font=dict(size=18, family='Arial', color='black'))
# Show the plot
fig.show()
Contractor vs. Job Type Description:ΒΆ
Investigate to know any pattern of abandon reasons given by contractors . This can highlight areas where certain contractors may need additional support or training.
1- As we know from univariate analysis that predominant job types are of "Responsive Repairs" and "Gas Responsive Repairs", but we can see that only couple of Contractors being entasked with 4 "Responsive Repairs" with no contractors assigned to handle "Gas Responsive Repairs".
2- This finding potentially highlights the need for skill augmentation of the existing contractor pool with either training or replacing the existing pool with the skilled staff.
ct_cont_jobtype_desc = pd.crosstab(Int_df_merged['CONTRACTOR'], Int_df_merged['JOB_TYPE_DESCRIPTION'])
# Calculate percentages
ct_percentages = ct_cont_jobtype_desc.div(ct_cont_jobtype_desc.sum(axis=1), axis=0) * 100
# Plotting
fig = px.imshow(ct_cont_jobtype_desc, text_auto=".2f", color_continuous_scale='Blues', aspect="auto")
# Update hover template to include percentages
hover_template = (
"<b>Contractor:</b> %{y}<br>"
"<b>Job Type Description:</b> %{x}<br>"
"<b>Count:</b> %{z}<br>"
"<b>Percentage:</b> %{customdata:.2f}%"
)
fig.update_traces(
hovertemplate=hover_template,
customdata=ct_percentages.values,
textfont_size=9
)
# Increase the size of the graph
fig.update_layout(
width=900,
height=600,
font=dict(size=10),
title_text="<b>Contractor vs. Job Type Description</b>",
title_x=0.5,
)
# Enable zoom and pan options
fig.update_layout(
scene=dict(
aspectmode="manual",
aspectratio=dict(x=1, y=1, z=1),
),
margin=dict(r=10, t=25, b=40, l=60),
hovermode="closest",
scene_camera=dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.25, y=1.25, z=1.25),
),
uirevision="same",
)
# Make x-axis label bold
fig.update_xaxes(title_text='Job Type Description', title_font=dict(size=18, family='Arial', color='black'))
# Make y-axis label bold
fig.update_yaxes(title_text='Contractor', title_font=dict(size=18, family='Arial', color='black'))
# Manually set y-axis tick values
fig.update_yaxes(tickvals=list(range(len(ct_cont_jobtype_desc.index))), ticktext=ct_cont_jobtype_desc.index)
# Show the plot
# fig.show()
# Add this line after updating the layout
fig.show(config={'scrollZoom': True})
Service Type Description vs. Job Status:ΒΆ
Investigate to know any specific area is more prone to work abandonment. This will allow us to focus more on that area by diversion of optimal no of resources.
1- As we know from univariate analysis that predominant job types are of "Responsive Repairs" and "Gas Responsive Repairs", but we can see that only couple of Contractors being entasked with 4 "Responsive Repairs" with no contractors assigned to handle "Gas Responsive Repairs".
2- This finding potentially highlights the need for skill augmentation of the existing contractor pool with either training or replacing the existing pool with the skilled staff.
Job Type vs. Job Status:ΒΆ
Analyze how different job types correspond to different job statuses. This can help in understanding the completion status of different types of repair jobs.
Property Type vs. Initial Priority Description:ΒΆ
Explore the relationship between property types and the initial priority assigned to repair jobs. This can provide insights into the urgency of repairs for different property types.
cross_tab_job_status = pd.crosstab(Int_df_merged['JOB_TYPE_DESCRIPTION'], Int_df_merged['JOB_STATUS_DESCRIPTION'])
# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(cross_tab_job_status, annot=True, cmap='YlGnBu', fmt='d', linewidths=.5, cbar_kws={'label': 'Job Status Description'})
# Customize the plot
plt.title('Job Type vs. Job Status - Heatmap', fontsize=14, weight='bold')
plt.xlabel('Job Status')
plt.ylabel('Job Type')
plt.xticks(rotation=45, ha='right')
plt.show()
cross_tab_priority = pd.crosstab(Int_df_merged['Property Type'], Int_df_merged['Initial Priority Description'])
# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(cross_tab_priority, annot=True, cmap='YlGnBu', fmt='d', linewidths=.5,cbar_kws={'label': 'Initial Priority Description'})
# Customize the plot
plt.title('Property Type vs. Initial Priority Description - Heatmap',fontsize=14, weight='bold')
plt.xlabel('Initial Priority Description')
plt.ylabel('Property Type')
plt.xticks(rotation=45, ha='right')
plt.show()
# non_integer_years = Int_df_merged.loc[~Int_df_merged['Year'].astype(int).eq(Int_df_merged['Year']), 'Year']
# print(non_integer_years.unique())
# missing_values_count = Int_df_merged['Date Comp'].apply(lambda x: x == '' or pd.isna(x)).sum()
# print(f"Number of missing or empty values in 'column1': {missing_values_count}")
print(Int_df_merged['Year_comp_log'].unique())
[2022 2023]
# Int_df_merged
# complaints_data['Weekday']
# Int_df_merged.info()
# Create a dictionary to store counts for each column
unique_counts = {}
# Iterate over columns and get unique value counts
for column in Int_df_merged.columns:
unique_counts[column] = Int_df_merged[column].nunique()
# Convert the dictionary to a DataFrame for better display
unique_counts_df = pd.DataFrame(list(unique_counts.items()), columns=['Column', 'Unique Counts'])
# Print or display the DataFrame
print(unique_counts_df)
Column Unique Counts 0 Job No 21286 1 Job Type 44 2 JOB_TYPE_DESCRIPTION 44 3 CONTRACTOR 33 4 Year of Build Date 36 5 Jobsourcedescription 15 6 Property Ref 2078 7 Property Type 10 8 Initial Priority 27 9 Initial Priority Description 31 10 Job Status 6 11 LATEST_PRIORITY 27 12 ABANDON_REASON_CODE 20 13 Day of Date Logged 7 14 SOR_CODE 1073 15 SOR_DESCRIPTION 1063 16 Date Logged 548 17 Mgt Area 3 18 TRADE_DESCRIPTION 31 19 Date Comp 544 20 Total Value 3306 21 ABANDON_REASON_DESC 19 22 JOB_STATUS_DESCRIPTION 6 23 Latest Priority Description 26 24 Year_comp_log 2 25 Year_comp_solved 2 26 Month_comp_log 12 27 Month_comp_solved 12 28 Day of Date Comp 7 29 Day_comp_log 31
# Convert date columns to datetime format
Int_df_merged['Date Logged'] = pd.to_datetime(Int_df_merged['Date Logged'], errors='coerce')
Int_df_merged['Date Comp'] = pd.to_datetime(Int_df_merged['Date Comp'], errors='coerce')
# Calculate the number of days for each trade
Int_df_merged['Days Taken'] = (Int_df_merged['Date Comp'] - Int_df_merged['Date Logged']).dt.days
# Create a scatter plot
fig = px.scatter(Int_df_merged, x='TRADE_DESCRIPTION', y='Days Taken',
title='<b>Scatter Plot-Number of Days Taken by Each Trade for Completion</b>',
labels={'TRADE_DESCRIPTION': '<b>Trade Description</b>', 'Days Taken': '<b>Number of Days</b>'},
hover_data=['Date Logged', 'Date Comp'])
# Customize the layout if needed
fig.update_layout(title_x=0.5)
# Show the plot
fig.show()
print(Int_df_merged[['Date Logged', 'Date Comp']].isnull().sum())
# Check missing value count for each column
missing_values = Int_df_merged.isnull().sum()
# Display the missing value count for each column
print("Missing Value Count for Each Column:")
print(missing_values)
print(Int_df_merged.info())
Date Logged 0 Date Comp 790 dtype: int64 Missing Value Count for Each Column: Job No 0 Job Type 0 JOB_TYPE_DESCRIPTION 0 CONTRACTOR 0 Year of Build Date 0 Jobsourcedescription 0 Property Ref 0 Property Type 0 Initial Priority 0 Initial Priority Description 0 Job Status 0 LATEST_PRIORITY 0 ABANDON_REASON_CODE 0 Day of Date Logged 0 SOR_CODE 0 SOR_DESCRIPTION 0 Date Logged 0 Mgt Area 0 TRADE_DESCRIPTION 0 Date Comp 790 Total Value 0 ABANDON_REASON_DESC 17253 JOB_STATUS_DESCRIPTION 0 Latest Priority Description 0 Year_comp_log 0 Year_comp_solved 790 Month_comp_log 0 Month_comp_solved 790 Day of Date Comp 790 Day_comp_log 0 Days Taken 790 dtype: int64 <class 'pandas.core.frame.DataFrame'> Int64Index: 21286 entries, 0 to 21285 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Job No 21286 non-null int64 1 Job Type 21286 non-null object 2 JOB_TYPE_DESCRIPTION 21286 non-null object 3 CONTRACTOR 21286 non-null object 4 Year of Build Date 21286 non-null datetime64[ns] 5 Jobsourcedescription 21286 non-null object 6 Property Ref 21286 non-null object 7 Property Type 21286 non-null object 8 Initial Priority 21286 non-null object 9 Initial Priority Description 21286 non-null object 10 Job Status 21286 non-null int64 11 LATEST_PRIORITY 21286 non-null object 12 ABANDON_REASON_CODE 21286 non-null object 13 Day of Date Logged 21286 non-null object 14 SOR_CODE 21286 non-null object 15 SOR_DESCRIPTION 21286 non-null object 16 Date Logged 21286 non-null datetime64[ns] 17 Mgt Area 21286 non-null object 18 TRADE_DESCRIPTION 21286 non-null object 19 Date Comp 20496 non-null datetime64[ns] 20 Total Value 21286 non-null float64 21 ABANDON_REASON_DESC 4033 non-null object 22 JOB_STATUS_DESCRIPTION 21286 non-null object 23 Latest Priority Description 21286 non-null object 24 Year_comp_log 21286 non-null int64 25 Year_comp_solved 20496 non-null float64 26 Month_comp_log 21286 non-null int64 27 Month_comp_solved 20496 non-null object 28 Day of Date Comp 20496 non-null object 29 Day_comp_log 21286 non-null int64 30 Days Taken 20496 non-null float64 dtypes: datetime64[ns](3), float64(3), int64(5), object(20) memory usage: 5.2+ MB None
Descriptive AnalysisΒΆ
Basic descriptive statistics of numeric variables.
Understanding the distribution of response variable ("Total Value") for different categories of a feature variable Note- Here we have all nominal categorical predictor variables and numeric response variable. This interpretation is carried out using boxplot.
2- Understand the influence of predictors on response variable ( Potential non-linear/non-monotonic relationship between predictor and response variables due to the skewed nature of the data (beacuse of presence of outliers)
Int_df_merged[['Total Value','Year_comp_log', 'Year_comp_solved','Month_comp_log', 'Month_comp_solved','Day_comp_log','Days Taken']].describe()
| Total Value | Year_comp_log | Year_comp_solved | Month_comp_log | Day_comp_log | Days Taken | |
|---|---|---|---|---|---|---|
| count | 21286.000000 | 21286.000000 | 20496.000000 | 21286.000000 | 21286.000000 | 20496.000000 |
| mean | 166.733017 | 2022.681481 | 2022.695209 | 7.143146 | 15.559006 | 11.776786 |
| std | 643.976241 | 0.465913 | 0.460330 | 3.359103 | 8.711133 | 20.618111 |
| min | 0.000000 | 2022.000000 | 2022.000000 | 1.000000 | 1.000000 | -26.000000 |
| 25% | 0.000000 | 2022.000000 | 2022.000000 | 4.000000 | 8.000000 | 0.000000 |
| 50% | 100.000000 | 2023.000000 | 2023.000000 | 8.000000 | 16.000000 | 5.000000 |
| 75% | 109.500000 | 2023.000000 | 2023.000000 | 10.000000 | 23.000000 | 16.000000 |
| max | 22295.640000 | 2023.000000 | 2023.000000 | 12.000000 | 31.000000 | 371.000000 |
Interpretation - Basic descriptive statistics of numeric variables.ΒΆ
- Total Value:
Reflects financial aspects, like cost or value associated with each entry. The average value is around 166.73, with a wide range (standard deviation of 643.98). The distribution is skewed, as indicated by the difference between the mean and median (50% value is 100).
- Days Taken (Time taken for addressing the complaint):
The average time taken is approximately 11.78 days. A wide range in the number of days taken, as indicated by the high standard deviation. The presence of negative values gere was beacuse of missing repair completion dates.
The maximum time taken for some complaints is as long as 371 days, suggesting complex or delayed cases. #########################################################################################################################
Descriptive statistics:ΒΆ
- Total Value: Insight: The average total value of repairs is 166.73, but the standard deviation is high (643.98), suggesting a wide range of repair costs. The presence of zero values may indicate free or non-monetary repairs.
These insights provide a preliminary understanding of the data, highlighting patterns, variability, and potential anomalies in the repair records.
######################################################################################################################### 4. Inference:
Financial Aspect: There's a considerable variance in the financial aspect of the complaints. Temporal Trends: The data might show monthly or seasonal trends in complaint logging, which could be further explored for patterns. Complaint Resolution: The time taken for resolution varies widely, with some complaints being addressed very quickly and others taking a significant amount of time. Negative values in 'Days Taken' is beacuse "Day Comp" is missing for 790 observations.
predictors = ['Property Type', 'Jobsourcedescription',
'Initial Priority Description', 'JOB_STATUS_DESCRIPTION',
'TRADE_DESCRIPTION', 'Latest Priority Description', 'Mgt Area', 'CONTRACTOR']
response = 'Total Value'
# Create box plots for each predictor against the response variable
for predictor in predictors:
plt.figure(figsize=(12, 6))
sns.boxplot(x=predictor, y=response, data=Int_df_merged)
plt.xticks(rotation=90)
plt.title(f'Box plot of {response} vs {predictor}', fontweight ="bold")
plt.show()
# Int_df_copy
df = Int_df_merged.copy()
predictors = ['Property Type', 'Jobsourcedescription',
'Initial Priority Description', 'JOB_STATUS_DESCRIPTION',
'TRADE_DESCRIPTION', 'Latest Priority Description', 'Mgt Area', 'CONTRACTOR']
response = 'Total Value'
# Create interactive box plots using Plotly
for predictor in predictors:
fig = go.Figure()
# Add the main box plot trace with blue boxes
fig.add_trace(go.Box(x=df[predictor], y=df[response], boxpoints='all', name='Box Plot', marker=dict(color='blue')))
# Define logic to identify suspected outliers above the upper whisker
upper_whisker = df[response].quantile(0.75) + 1.5 * (df[response].quantile(0.75) - df[response].quantile(0.25))
suspected_outliers = df[df[response] > upper_whisker]
# Add a separate scatter trace for suspected outliers above the whisker in red
fig.add_trace(go.Box(
x=suspected_outliers[predictor],
y=suspected_outliers[response],
name="Suspected Outliers",
boxpoints='suspectedoutliers', # only suspected outliers
marker=dict(
color='rgb(8,81,156)',
outliercolor='rgba(219, 64, 82, 0.6)',
line=dict(
outliercolor='rgba(219, 64, 82, 0.6)',
outlierwidth=2)),
line_color='rgb(8,81,156)'
))
fig.update_xaxes(categoryorder='total ascending')
fig.update_layout(
xaxis_title=predictor,
yaxis_title=response,
title=dict(
text=f'<b>Box plot - {response} vs {predictor}</b>',
x=0.5,
y=0.95,
xanchor='center',
yanchor='top'
)
)
fig.show()
Box Plot Interpretation(Understanding the distribution of response variable ("Total Value") for different categories of a feature variable)ΒΆ
Total Repair Value vs Property TypeΒΆ
total value versus property type provides a concise summary of the variability and central tendency of costs across different types of properties.
The medians, spreads, and presence of outliers vary significantly between categories. Some property types have a wide range of values with many outliers indicating sporadic high costs, while others show more consistency with fewer outliers.
The 'Block, No Shared Area' has the most uniform costs with a narrow interquartile range, whereas 'Detached' and 'Semi Detached' properties exhibit a number of high-cost outliers. This suggests that costs are not uniform across property types and that certain types may be more prone to variable and higher expenses.
####################################################################################################################
total value vs initial priority descriptionΒΆ
This indicates 'Total Value' varies significantly across 'Initial Priority Descriptions', with 'Emergency' categories showing the highest variability and outliers, indicating sporadically high costs.
The median values for most categories are skewed towards the lower end, suggesting that lower values are more common.
Categories with longer time frames, indicated by the number of 'Calendar Days', generally show lower ranges and variances in total value. Certain categories, like those for 'Health and Safety' and 'Compliance', have slightly higher median values, suggesting these issues tend to have a higher associated cost.
####################################################################################################################
total value vs job source descriptionΒΆ
It indicates a wide variation in repair costs among different job order sources, with some showing high-cost outliers, reflecting a diverse range of job complexities.
Sources named "Total (Mobile App)" and "OneMobile App" stand out with higher median repair costs and greater variability, hinting at a tendency for more expensive repairs from these sources.
In contrast, sources such as "CSC Phone call", "Via Website", and "CSC E-Mail" exhibit a more consistent and lower range of costs, suggesting these channels typically handle less costly or more uniform jobs.
The distribution of costs across sources could inform strategic decisions in resource planning and financial forecasting within the maintenance sector.
####################################################################################################################
total value vs job status descriptionΒΆ
Three categories: ('Invoice Accepted', 'Abandoned', and 'Work Completed'. )
'Invoice Accepted' has a very large range of values with numerous high-value outliers, indicating significant variability in costs associated with jobs where invoices have been accepted.
'Abandoned' jobs have a very narrow range of costs, suggesting these jobs typically involve lower and more consistent expenses. 'Work Completed' has a modest range of values with a few outliers, indicating a moderate variance in the total value of completed jobs.
This visual suggests that jobs with accepted invoices may potentially be more complex or expensive compared to those that are completed or abandoned.
####################################################################################################################
total value vs trade descriptionΒΆ
It shows a wide array of trades, each with varying costs associated with them.
Some trades, such as 'Drainage Works' and 'Out of Hours Work', show a higher range of 'Total Value' with several outliers indicating occasional high-cost jobs. Most trades have their median values toward the lower end of the scale, suggesting that lower-cost jobs are more typical within each trade.
The trades like 'Electrical', 'Roofing', and 'Gas Repairs' show a particularly wide spread in values, indicating a greater variability in job costs.
On the other end, trades such as 'Inspection', 'Water', and 'Fire' have very few outliers and a narrower interquartile range, reflecting more consistency in the costs of jobs within these categories. ####################################################################################################################
total value vs latest priority descriptionΒΆ
Jobs with an 'Emergency' priority have a broad range of total values and numerous high-value outliers, suggesting significant cost variability in emergency responses.
The 'Urgent' categories also exhibit a wide range in values but with fewer outliers than 'Emergency', indicating somewhat less variability. Categories based on calendar days display generally lower total values, with '7 Calendar Days - Health and Safety' and '28 Calendar Days - Compliance' having slightly higher medians within this group, suggesting these specific priorities may involve more costly jobs. The plot overall indicates that the urgency associated with a job correlates with an increase in the range and median of its total value.
Total Repair Value - Statistical SummaryΒΆ
Count (21,286): This indicates that there are 21,286 records of repair jobs in the dataset. It's a substantial number, suggesting a significant volume of repair work handled.
Mean (~166.73): On average, the repair jobs cost about 166.73 units (currency not specified). This average gives a general idea about the typical cost of repairs, but it can be influenced by extremely high or low values.
Standard Deviation (~643.98): The high standard deviation suggests a wide variation in the repair costs. This indicates that while many repairs might be around the average cost, there are quite a few that are significantly lower or higher in cost.
Minimum (0): The minimum value being 0 suggests that there are some records where no cost was associated with the repair. This could indicate warranty work, pro bono services, or data entry errors.
25th Percentile (0): 25% of the repairs cost 0 units or less. Again, this supports the presence of a significant number of repairs with no associated cost.
Median (50th Percentile, 100): The median value being 100 units means that half of the repairs cost 100 units or less. The median being lower than the mean suggests a right-skewed distribution, where a smaller number of high-cost repairs are pulling the average up.
75th Percentile (109.50): 75% of the repair jobs cost 109.50 units or less. This further indicates that most repair costs are clustered at the lower end of the spectrum.
Maximum (22,295.64): The maximum value is significantly higher than the mean and median, highlighting that there are some extremely high-cost repairs. These could be outlier cases involving complex, extensive, or emergency repair work.
Interpretations: Pricing Strategy: The data suggests that the business typically handles low to moderately priced repairs, but it is also capable of handling a few high-cost repairs. This might influence pricing strategies and marketing.
Customer Segmentation: The wide range of repair costs might indicate a diverse customer base with varying needs β from minor fixes to major overhauls.
Resource Allocation: Knowing that most repairs are of lower cost, the business might focus resources on efficiently handling these, while also being prepared for occasional high-cost repairs.
Potential for Up-selling: Given that a significant number of repairs have zero cost, there might be an opportunity to up-sell additional services or warranties.
Review of High-Cost Repairs: Analyzing the reasons behind the high-cost repairs could help in improving cost-efficiency or identifying areas requiring special expertise or equipment.
In summary, the "Total Repair Value" data offers valuable insights for strategic planning, resource allocation, and customer service strategies in the repair business.
# Display basic statistics
# print(Int_df_merged.describe())
new_df = Int_df_merged[['Year of Build Date','Date Logged', 'Date Comp', 'Days Taken', 'Total Value']].copy()
# Renaming columns in the new DataFrame as specified
new_df.rename(columns={'Date Comp': 'Date_Comp_Solved', 'Days Taken': 'Days_Taken_to_Repair', 'Total Value': 'Total_Repair_Value'}, inplace=True)
# Extract year, month, and day from 'Date Comp' (assuming it is the same as 'Date_Comp_Solved')
new_df['Year_Comp_Solved'] = new_df['Date_Comp_Solved'].dt.year
new_df['Month_Comp_Solved'] = new_df['Date_Comp_Solved'].dt.month
new_df['Day_Comp_Solved'] = new_df['Date_Comp_Solved'].dt.day
# Now, create 'Year_Comp_Logged', 'Month_Comp_Logged', 'Day_Comp_Logged'
# Assuming 'Day of Date Logged' is a datetime column:
new_df['Year_Comp_Logged'] = new_df['Date Logged'].dt.year
new_df['Month_Comp_Logged'] = new_df['Date Logged'].dt.month
new_df['Day_Comp_Logged'] = new_df['Date Logged'].dt.day
# Displaying the first few rows of the new DataFrame
# new_df.head()
# Int_df_merged.info()
new_df['Total_Repair_Value']
0 100.00
1 267.12
2 88.45
3 36.63
4 100.00
...
21281 100.00
21282 0.00
21283 100.00
21284 0.00
21285 278.00
Name: Total_Repair_Value, Length: 21286, dtype: float64
new_df['Total_Repair_Value'].describe()
count 21286.000000 mean 166.733017 std 643.976241 min 0.000000 25% 0.000000 50% 100.000000 75% 109.500000 max 22295.640000 Name: Total_Repair_Value, dtype: float64
new_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 21286 entries, 0 to 21285 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year of Build Date 21286 non-null datetime64[ns] 1 Date Logged 21286 non-null datetime64[ns] 2 Date_Comp_Solved 20496 non-null datetime64[ns] 3 Days_Taken_to_Repair 20496 non-null float64 4 Total_Repair_Value 21286 non-null float64 5 Year_Comp_Solved 20496 non-null float64 6 Month_Comp_Solved 20496 non-null float64 7 Day_Comp_Solved 20496 non-null float64 8 Year_Comp_Logged 21286 non-null int64 9 Month_Comp_Logged 21286 non-null int64 10 Day_Comp_Logged 21286 non-null int64 dtypes: datetime64[ns](3), float64(5), int64(3) memory usage: 1.9 MB
Data Skewness Check- Overlay Histogram and Skewness Metrics, BoxplotsΒΆ
# List of columns to exclude
# columns_to_exclude = ['Date Logged', 'Date_Comp_Solved']
# Calculate skewness excluding the specified columns
# skewness = new_df.drop(columns=columns_to_exclude).skew()
selected_columns = ['Days_Taken_to_Repair', 'Total_Repair_Value']
skewness = new_df[selected_columns].apply(lambda x: x.dt.year if np.issubdtype(x.dtype, np.datetime64) else x).skew()
# Printing the skewness values
print(skewness)
Days_Taken_to_Repair 5.476948 Total_Repair_Value 15.855181 dtype: float64
# Create a subplot figure with 3 rows and 2 columns, adjust row heights for spacing
fig = make_subplots(rows=3, cols=2, subplot_titles=("Days_Taken_to_Repair",
"Total_Repair_Value",
"Year_Comp_Logged and Year_Comp_Solved",
"Month_Comp_Logged and Month_Comp_Solved",
"Day_Comp_Logged and Day_Comp_Solved"),
row_heights=[0.4, 0.4, 0.4],
vertical_spacing=0.15,
horizontal_spacing=0.15)
# Add traces to the subplots
fig.add_trace(go.Histogram(x=new_df['Days_Taken_to_Repair'], name='Days_Taken_to_Repair', opacity=0.75), row=1, col=1)
fig.add_trace(go.Histogram(x=new_df['Total_Repair_Value'], name='Total_Repair_Value', opacity=0.75), row=1, col=2)
fig.add_trace(go.Histogram(x=new_df['Year_Comp_Logged'], name='Year_Comp_Logged', opacity=0.75), row=2, col=1)
fig.add_trace(go.Histogram(x=new_df['Year_Comp_Solved'], name='Year_Comp_Solved', opacity=0.75), row=2, col=1) # Same cell as Year_Comp_Logged
fig.add_trace(go.Histogram(x=new_df['Month_Comp_Logged'], name='Month_Comp_Logged', opacity=0.75), row=2, col=2)
fig.add_trace(go.Histogram(x=new_df['Month_Comp_Solved'], name='Month_Comp_Solved', opacity=0.75), row=2, col=2) # Same cell as Month_Comp_Logged
fig.add_trace(go.Histogram(x=new_df['Day_Comp_Logged'], name='Day_Comp_Logged', opacity=0.75), row=3, col=1)
fig.add_trace(go.Histogram(x=new_df['Day_Comp_Solved'], name='Day_Comp_Solved', opacity=0.75), row=3, col=1)
# Update layout for overlay mode in histograms
fig.update_layout(barmode='overlay', title_text="<b>Histograms for different Repair Timelines and Value</b>", width=1200, height=1200, title_x=0.5)
# Update x-axis for year histograms to display integer values
fig.update_xaxes(tickmode='array', tickvals=[2022, 2023], row=2, col=1)
fig.show()
fig = go.Figure()
fig.add_trace(go.Box(
y=new_df['Total_Repair_Value'], # Specify the column for the y-axis
name="Total Repair Value",
boxpoints='outliers', # only outliers
marker_color='rgb(107,174,214)',
line_color='rgb(107,174,214)'
))
fig.update_layout(title_text="<b>Box Plot- Total Repair Value</b>", title_x=0.5)
fig.show()
Data Skewness InterpretationΒΆ
The skewness values in the dataset provide insights into the distribution of various attributes relevant to the business.
Days_Taken_to_Repair (5.476948):ΒΆ
A high positive skewness indicates a . Most repair jobs are completed relatively quickly, but there are a significant number of jobs that take much longer than average. #######################################################################################################
Total_Repair_Value(Skewness = 15.228578):ΒΆ
1- This extremely high skewness value indicates a very right-skewed distribution. his suggests that most repair costs are low, but there are rare instances of very high repair costs. The presence of very high repair costs as outliers significantly influences the mean, pulling it towards higher values. 2- This is consistent with a few outlier jobs significantly driving up the average cost, as previously noted in the descriptive statistics. 3- 'Total Repair Value' variable here suggests a highly skewed distribution with a significant difference between the mean and median, and a very large maximum value compared to the mean and standard deviation. ########################################################################################################
Days Taken to Repair (Skewness = 5.476948):ΒΆ
- This high positive skewness value suggests a significantly right-skewed distribution with a long tail to the right. Most repairs are likely completed within a shorter timeframe, but there are some cases where repairs take an exceptionally long time.
- These cases are the outliers that cause the long right tail in the distribution.
- The skewness suggests that while most repair tasks are completed relatively quickly, there are a few instances where the repair time is substantially longer, pulling the average repair time higher.
- This could mean that while most repairs are straightforward, a few are particularly complex or face delays.
Avg no of days to complete repair for different tradesΒΆ
1- We can see below that "Play and Recreation", "Water", and "Inspection" are outlier trades in taking a much higher repair completion times than other trades with these 3 trades taking on average (71, 70.6, 65.9) days respectively.
2- This explains the earlier high positive skewness indicates a long tail to the right observed in the data.
3- It signifies that most repair jobs are completed relatively quickly, but there are a significant number of jobs that take much longer than average.
4- This could mean that while most repairs are straightforward, these 3 or 4 trades particularly complex or face delays.
Int_df_merged['Date Logged'] = pd.to_datetime(Int_df_merged['Date Logged'], errors='coerce')
Int_df_merged['Date Comp'] = pd.to_datetime(Int_df_merged['Date Comp'], errors='coerce')
# Calculate the number of days for each trade
Int_df_merged['Days Taken'] = (Int_df_merged['Date Comp'] - Int_df_merged['Date Logged']).dt.days
# Calculate the average number of days for each trade and round to one decimal place
avg_days_per_trade = Int_df_merged.groupby('TRADE_DESCRIPTION')['Days Taken'].mean().reset_index()
avg_days_per_trade['Days Taken'] = avg_days_per_trade['Days Taken'].round(1)
# Sort the DataFrame in descending order of 'Days Taken'
avg_days_per_trade = avg_days_per_trade.sort_values('Days Taken', ascending=False)
# Create a bar plot with different colors for each bar
fig = px.bar(avg_days_per_trade, x='TRADE_DESCRIPTION', y='Days Taken',
title='<b>Bar plot - Average Number of Days Taken for Each Trade</b>',
labels={'TRADE_DESCRIPTION': '<b>Trade Description</b>', 'Days Taken': '<b>Average Number of Days</b>'},
color='TRADE_DESCRIPTION')
# Customize the layout if needed
fig.update_layout(title_x=0.5)
# Show the plot
fig.show()
Comparative Analysis of Management AreasΒΆ
avg_days_per_trade = Int_df_merged.groupby('Mgt Area')['Days Taken'].mean().reset_index()
avg_days_per_trade['Days Taken'] = avg_days_per_trade['Days Taken'].round(1)
avg_days_per_trade = avg_days_per_trade.sort_values('Days Taken', ascending=False)
# Prepare data for the second plot (count of contractors by management area)
contractor_count = Int_df_merged.groupby('Mgt Area')['CONTRACTOR'].count().reset_index()
# Prepare data for the third plot (count of jobs by management area)
jobs_count = Int_df_merged.groupby('Mgt Area')['Job No'].count().reset_index()
# Create subplots with an additional row
fig = make_subplots(rows=2, cols=2, subplot_titles=(
'<b>Avg. No. of Days Taken by Management Areas</b>',
'<b>Contractors under Each Management Area</b>',
'<b>Number of Jobs per Management Area</b>'),
vertical_spacing=0.15,
horizontal_spacing=0.15)
# Add the first plot (Avg Days) with text labels
fig.add_trace(
go.Bar(
x=avg_days_per_trade['Mgt Area'],
y=avg_days_per_trade['Days Taken'],
name='Avg Days',
text=avg_days_per_trade['Days Taken'], # Adding text labels
textposition='outside' # Positioning the text above the bars
),
row=1, col=1
)
# Add the second plot (Contractor Count)
fig.add_trace(
go.Bar(x=contractor_count['Mgt Area'], y=contractor_count['CONTRACTOR'], name='Contractor Count', text=contractor_count['CONTRACTOR'], textposition='outside'),
row=1, col=2
)
# Add the third plot (Jobs Count)
fig.add_trace(
go.Bar(x=jobs_count['Mgt Area'], y=jobs_count['Job No'], name='Jobs Count', text=jobs_count['Job No'], textposition='outside'),
row=2, col=1
)
# Update layout
fig.update_layout(title_text='<b>Comparative Analysis: Management Areas</b>', title_x=0.5,width=1000, height=800)
fig.show()
# Count the number of each trade description under each management area
trade_description_counts = Int_df_merged.groupby(['Mgt Area', 'TRADE_DESCRIPTION']).size().reset_index(name='Count')
# Pivot the data for stacked bar chart
pivot_data = trade_description_counts.pivot(index='Mgt Area', columns='TRADE_DESCRIPTION', values='Count').fillna(0)
# Create a stacked bar plot
fig = go.Figure()
# Add traces for each trade description
for trade_desc in pivot_data.columns:
fig.add_trace(go.Bar(
x=pivot_data.index,
y=pivot_data[trade_desc],
name=trade_desc
))
# Customize the layout
fig.update_layout(
barmode='stack',
title='<b>Stacked Bar Plot - Trade Descriptions per Management Area</b>',
xaxis_title='<b>Management Area</b>',
yaxis_title='<b>Count</b>',
title_x=0.5
)
# Add total count text labels on top of each stacked bar
for area in pivot_data.index:
total_count = pivot_data.loc[area].sum().astype(int) # Convert total count to integer
fig.add_annotation(
x=area, y=total_count,
text=str(total_count),
showarrow=False,
yshift=10
)
# Show the plot
fig.show()
Mgt Area vs. Trade Description:ΒΆ
Explore the distribution of repair trades across different management areas.
1- We know from earlier that Repair Trades like ( "Gas Repairs", "Capentry", "Plumbing", and "Electric Repairs") are the dominant trades that are abandoned the most in ("Terrace", and "End Terrace") type properties.
2- We see here that Mgt Area ("MA1" ) is highly engaged with these skilled trades than other managements.
3- This calls for more analysis to understand whether the MA1 is overloaded with bulk load requests, and more work load balancing is required in the form of optimal resource allocation or skill augmentation with more skilled contractors. #####################################################################################################
Mgt Area vs. Abandon Reason Description:ΒΆ
Explore the distribution of abandon reason across different management areas, and see if these work abandonments are concentrated across any specfic mangagement area dealing with contractors or there is no such pattern.
1- We can see here that "MA1" is being an outlier in the form of most properties under it are being abandoned by the contractors for different reasons as mentioned below.
2 - Most importantly, we can see here that, Mgt Area (MA1) is abandoing the property repair service requests primarilry due to reasons ("No work required", "Alternative Job", "No Access", "Duplicate Order", and "Tenant Missed Apt" reasons.
2- We need to priortise the focus on management ("MA1") to know the reasons for abandonment and improve the service level.
3- Also, need to understand why there are disproportionately higher number of requests are being routed through "MA1". This will allow for potential optimal allocation of resources, skilled resource augmentation etc.
#####################################################################################################
Mgt Area vs. Contractors:ΒΆ
1 - We can see from below that "MA1" despite having a significantly large no of contractors (=20,489) than "MA2" (=759) and "MA3" (=29).
2- "MA1" takes more no of days on average (=5.5, and = 8.2) to complete the task than "MA2" and "MA3" respectively.
2- This is highly likely due to the fact that Mgt. area ("MA1") being engaged in all types of trades than all incuding trades being frequently abandonded and demand "Responsive Repairs" and "Gas Responsive Repairs" job types with mostly " Appointable" and "Emergency" priorities.
Unsupervised AnalysisΒΆ
Clustering Analysis:ΒΆ
Objective: To segment jobs based on its on job type and initial priority level to understand how the total costs are distributed. 1- This will help in knowing what type of repair jobs based on its priority level have costed more to the community over the years (here 18 months), and focus on these areas to reduce costs.
# One-Hot Encoding for categorical variables
ohe = OneHotEncoder(sparse=False)
ohe_features = ohe.fit_transform(Int_df_merged[['JOB_TYPE_DESCRIPTION', 'Initial Priority']])
ohe_feature_labels = ohe.get_feature_names_out(['JOB_TYPE_DESCRIPTION', 'Initial Priority'])
# Log transformation of Total Value
Int_df_merged['Log_Total_Value'] = np.log1p(Int_df_merged['Total Value'])
# Feature Scaling (including log-transformed Total Value)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(np.concatenate((ohe_features, Int_df_merged[['Log_Total_Value']]), axis=1))
# DBSCAN Clustering
dbscan = DBSCAN(eps=0.5, min_samples=5) # Adjust epsilon and min_samples as needed
clusters = dbscan.fit_predict(scaled_features)
# Create a new DataFrame for clustering results
clustered_df = Int_df_merged.copy()
clustered_df['Cluster'] = clusters
# Create a dictionary to map cluster labels to cluster information
cluster_info = {}
for cluster_label in np.unique(clusters):
cluster_data = clustered_df[clustered_df['Cluster'] == cluster_label]
job_type_description = cluster_data['JOB_TYPE_DESCRIPTION'].value_counts().idxmax()
initial_priority = cluster_data['Initial Priority Description'].value_counts().idxmax()
cluster_info[cluster_label] = (job_type_description, initial_priority)
# Calculate the frequency of each cluster
cluster_counts = clustered_df['Cluster'].value_counts()
# Select the top 5 clusters
top_clusters = cluster_counts.head(5).index
# Create a mapping from original cluster labels to consecutive numbers
cluster_label_mapping = {cluster: i for i, cluster in enumerate(top_clusters)}
# Create a custom legend dictionary with "cluster-" prefix and respective colors
custom_legend = {
cluster: f"cluster-{cluster_label_mapping[cluster]}: {cluster_info[cluster][0]}, {cluster_info[cluster][1]}"
for cluster in top_clusters
}
# Create a scatter plot with the clustered DataFrame
plt.figure(figsize=(10, 6)) # Increase the figure size
# Create a list of sequential cluster numbers (0 to 4)
sequential_cluster_numbers = list(range(len(top_clusters)))
legend_handles = [] # Store legend handles to assign colors correctly
for cluster_label, cluster_number in zip(top_clusters, sequential_cluster_numbers):
cluster_data = clustered_df[clustered_df['Cluster'] == cluster_label]
color = plt.cm.Set1(cluster_label_mapping[cluster_label]) # Assign color based on mapping
scatter = plt.scatter(
cluster_data['Log_Total_Value'], [cluster_number] * len(cluster_data),
label=custom_legend[cluster_label], alpha=0.7, s=50, color=color
)
legend_handles.append(scatter) # Store the scatter plot for legend
# Set x-axis limits and labels
plt.ylim(-0.5, len(top_clusters) - 0.5)
plt.yticks(sequential_cluster_numbers, [f"cluster-{cluster_number}" for cluster_number in sequential_cluster_numbers])
# Set x-axis scale from 0 to 10
plt.xlim(0, 10)
# Set labels and title
plt.ylabel('Cluster')
plt.xlabel('Log-Transformed Total Value')
plt.title('Clustering - Repair costs based on Job Status and its Initial Priority',fontweight = "bold", fontsize=18)
# Create the custom legend and move it outside of the plot
plt.legend(handles=legend_handles, title='Cluster Information', loc='upper left',
bbox_to_anchor=(1.05, 1))
# Show the plot
plt.grid(True)
plt.tight_layout()
plt.show()
#copy of master dataframe
int_df_copy = Int_df_merged.copy()
# Check missing value count for each column
# missing_values = Int_df_bk.apply(lambda x: (x == '') | pd.isnull(x)).sum()
missing_values = int_df_copy.apply(lambda x: (x == '') | x.isna()).sum()
# Display the missing value count for each column
print("Missing Value Count for Each Column:")
print(missing_values)
# nan_count_per_column = int_df_copy.isna().sum()
# # Count total NaN values in the entire DataFrame
# total_nan_count = int_df_copy.isna().sum().sum()
# # Print the results
# print("NaN values in each column:")
# print(nan_count_per_column)
print(int_df_copy.describe())
Missing Value Count for Each Column:
Job No 0
Job Type 0
JOB_TYPE_DESCRIPTION 0
CONTRACTOR 0
Year of Build Date 0
Jobsourcedescription 0
Property Ref 0
Property Type 0
Initial Priority 217
Initial Priority Description 3982
Job Status 0
LATEST_PRIORITY 199
ABANDON_REASON_CODE 17253
Day of Date Logged 0
SOR_CODE 207
SOR_DESCRIPTION 207
Date Logged 0
Mgt Area 0
TRADE_DESCRIPTION 198
Date Comp 790
Total Value 0
ABANDON_REASON_DESC 17253
JOB_STATUS_DESCRIPTION 0
Latest Priority Description 691
Year_comp_log 0
Year_comp_solved 790
Month_comp_log 0
Month_comp_solved 790
Day of Date Comp 790
Day_comp_log 0
Days Taken 790
Log_Total_Value 0
dtype: int64
Job No Job Status Total Value Year_comp_log \
count 2.128600e+04 21286.000000 21286.000000 21286.000000
mean 1.777860e+06 89.518933 166.733017 2022.681481
std 1.898065e+05 16.457957 643.976241 0.465913
min 1.425786e+06 1.000000 0.000000 2022.000000
25% 1.617206e+06 93.000000 0.000000 2022.000000
50% 1.774636e+06 93.000000 100.000000 2023.000000
75% 1.942391e+06 93.000000 109.500000 2023.000000
max 2.104663e+06 93.000000 22295.640000 2023.000000
Year_comp_solved Month_comp_log Day_comp_log Days Taken \
count 20496.000000 21286.000000 21286.000000 20496.000000
mean 2022.695209 7.143146 15.559006 11.776786
std 0.460330 3.359103 8.711133 20.618111
min 2022.000000 1.000000 1.000000 -26.000000
25% 2022.000000 4.000000 8.000000 0.000000
50% 2023.000000 8.000000 16.000000 5.000000
75% 2023.000000 10.000000 23.000000 16.000000
max 2023.000000 12.000000 31.000000 371.000000
Log_Total_Value
count 21286.000000
mean 3.161438
std 2.477233
min 0.000000
25% 0.000000
50% 4.615121
75% 4.705016
max 10.012191
Missing Data and Skewness Analysis (Target Variable).ΒΆ
Descriptive statistics of the 'Total Value':ΒΆ
Range and Scale:ΒΆ
The 'Total Value' ranges from 0 to 22,295.64. This wide range suggests significant variability in the housing repair costs, with some repairs being very costly and others much less so.
Mean and Median:ΒΆ
The mean ('average') repair cost is approximately 166.73, while the median (the 'middle' value) is 100. The mean is higher than the median, which indicates a right-skewed distribution. In other words, there are a few very high repair costs that are pulling the average up.
Standard Deviation:ΒΆ
The standard deviation is 643.98, which is quite high relative to the mean. This high standard deviation indicates a wide spread in the repair costs, confirming the presence of high variability in the data.
Quartiles:ΒΆ
25% of repairs cost 0. This could indicate either free repairs, waived costs, or possibly data entry errors. The 50th percentile (median) is at 100, meaning half of the repair jobs cost 100 or less. The 75th percentile is at 109.5, indicating that 75% of the repairs are 109.5 or less. This further confirms the right-skewness of the data, as the most expensive 25% of repairs significantly increase the average cost.
Outliers:ΒΆ
Given the maximum value is much higher than the mean and 75th percentile, there are likely outliers in the data. These outliers can significantly influence the mean and standard deviation, potentially leading to misleading interpretations.
Implications for Modeling:ΒΆ
The skewness of the data suggests that transforming the 'Total Value' might be beneficial for modeling. For example, applying a logarithmic transformation can sometimes help in reducing the skewness and stabilizing variance.
In random forest models, these outliers and the skewed distribution could be reasons for the high MSE values. The model might be struggling to accurately predict these extreme values. Potential Data Issues:
The presence of a significant number of repairs with a cost of 0 needs to be investigated. If these represent missing or incorrect data, they should be handled appropriately.
In summary,'Total Value' data is right-skewed with a significant range and variability. This needs to be considered in the modeling approach and may require data transformations or different handling of outliers for better performance.
Linear regression key assumptions about the underlying data and the relationship between the predictor variables and the response variable.ΒΆ
Linear regression key assumptions about the underlying data and the relationship between the predictor variables and the response variable. It's important to be aware of these assumptions when interpreting the results of a linear regression model. Here are the main assumptions:ΒΆ
1- Linearity: The relationship between the predictor variables and the response variable should be approximately linear. This means that changes in the predictor variables are associated with a constant change in the response variable.
2- Independence of Residuals: The residuals (the differences between the observed and predicted values) should be independent. In other words, the value of the residual for one observation should not predict the value of the residual for another observation.
3- Homoscedasticity (Constant Variance of Residuals): The variability of the residuals should remain constant across all levels of the predictor variables. This implies that the spread of the residuals should be roughly the same for all values of the predictor variables.
4- Normality of Residuals: The residuals should be approximately normally distributed. This assumption is more critical for small sample sizes, as larger samples tend to approximate normality due to the Central Limit Theorem.
5- No Perfect Multicollinearity: The predictor variables should not be perfectly correlated with each other. High multicollinearity can lead to inflated standard errors of the coefficient estimates and make it challenging to identify the individual impact of each predictor variable.
6- No Autocorrelation of Residuals: The residuals should not show a systematic pattern over time if the data are collected over time. Autocorrelation in residuals suggests that there is some information left in the model that should be captured.
7- No Outliers or Influential Points: Outliers or influential points can strongly influence the estimated regression coefficients and the overall fit of the model. Identifying and addressing outliers is essential.
8- Linear Independence of Predictors: The predictor variables should not be perfectly correlated. Perfect multicollinearity (exact linear relationships between predictors) can cause problems in estimating the regression coefficients.
9- Additivity: The effect of a change in one predictor variable on the response variable is consistent, regardless of the values of other predictor variables. Normality of Predictors (Optional):
While not strictly an assumption of linear regression, normality of predictor variables can be desirable, especially in small samples, to improve the precision of parameter estimates. It's important to note that violating some of these assumptions might not necessarily invalidate the results of a linear regression model, but it can affect the precision and reliability of the estimates. Careful diagnostics and additional techniques, such as transformation of variables or using robust regression methods, can be applied to address violations of assumptions.
Checking- Linear Regression key assumptionsΒΆ
Prior to model training- Data AssumptionsΒΆ
1- Linearity: Scatter plots between each predictor and the target variable.
2- Independence of Predictors (No Perfect Multicollinearity):
Variance Inflation Factor (VIF) threshold (<=5), which is usually used for numerical predictors. or chi-square statistical test of independence (which is suitable in our case as almost all our predictors are nominal)
3- Checking for Outliers or Influential Points for non-linear monotonic relationship (between predictor and response var) due to the presence of outliers: Box plots
4- Normality of the Distribution of Predictors: With histograms or Q-Q plots. Bot applicable here as we do not have numerical predictor variables
Post model fitting - Diagnostical assumptionsΒΆ
1- Independence of Residuals & No Autocorrelation among them using (ACF) plot or the Durbin-Watson test. 2- Homoscedasticity (Constant Variance of Residuals) - Residual plot: Plot of residuals vs. predicted values. 3- Normality of Residuals: Using a Q-Q plot.
Data Linearity/Multicollinearity/Residual Plot/ Autocorrelation(ACF) TestΒΆ
# Define predictor variables and response variable
predictors = ['JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Property Type', 'Jobsourcedescription',
'Initial Priority Description', 'Latest Priority Description', 'JOB_STATUS_DESCRIPTION',
'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC', 'SOR_DESCRIPTION', 'Mgt Area']
response = 'Total Value'
# One-hot encode the categorical variables
X = pd.get_dummies(int_df_copy[predictors], drop_first=True)
y = int_df_copy[response]
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# One-hot encode the categorical variables
X = pd.get_dummies(int_df_copy[predictors], drop_first=True)
y = int_df_copy[response]
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit a linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
# Predict and calculate MSE
linear_pred = linear_model.predict(X_test)
linear_mse = mean_squared_error(y_test, linear_pred)
print(f'Linear Regression MSE: {linear_mse}')
# Fit a Random Forest model
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
# Predict and calculate MSE
rf_pred = rf_model.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)
print(f'Random Forest MSE: {rf_mse}')
# Compare the MSE of both models
if rf_mse < linear_mse:
print("The Random Forest model performs better, suggesting non-linear relationships.")
else:
print("No clear indication of non-linear relationships from model comparison.")
Linear Regression MSE: 5.440960066853942e+28 Random Forest MSE: 95744.3531322392 The Random Forest model performs better, suggesting non-linear relationships.
# Calculating the percentage difference
if linear_mse != 0:
percent_diff = ((linear_mse - rf_mse) / linear_mse) * 100
print(f"Random Forest MSE is {percent_diff:.2f}% lower than Linear Regression MSE.")
else:
print("Linear Regression MSE is zero, so percentage difference cannot be computed.")
Random Forest MSE is 100.00% lower than Linear Regression MSE.
Statistical Chi-Square Test of independence(to check dependency among predictor variables)ΒΆ
# List of categorical predictors
predictors = ['JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Property Type', 'Jobsourcedescription',
'Initial Priority Description', 'Latest Priority Description', 'JOB_STATUS_DESCRIPTION',
'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC', 'SOR_DESCRIPTION', 'Mgt Area']
# Function to perform Chi-Square Test
def chi_square_test(df, var1, var2):
contingency_table = pd.crosstab(df[var1], df[var2])
chi2, p, dof, ex = chi2_contingency(contingency_table)
return chi2, p
# Threshold for significance
p_value_threshold = 0.05
# DataFrame to store the results
chi_square_results = []
# Performing Chi-Square Tests and populating the list
for i in range(len(predictors)):
for j in range(i+1, len(predictors)):
chi2, p = chi_square_test(int_df_copy, predictors[i], predictors[j])
relationship = 'Not Independent' if p < p_value_threshold else 'Independent'
chi_square_results.append({
'Variable 1': predictors[i],
'Variable 2': predictors[j],
'P-Value': p,
'Relationship': relationship
})
# Convert list of dictionaries to DataFrame
chi_square_results_df = pd.DataFrame(chi_square_results)
# Assuming the chi_square_results_df is already created as shown in previous steps
# Counting the number of independent and dependent pairs
independent_count = chi_square_results_df[chi_square_results_df['Relationship'] == 'Independent'].shape[0]
dependent_count = chi_square_results_df[chi_square_results_df['Relationship'] == 'Not Independent'].shape[0]
# Constructing the message
message = f"There are {independent_count} pairs of variables that are independent and {dependent_count} pairs that are likely not independent (suggesting potential multicollinearity)."
print(message)
# Display the results
chi_square_results_df
There are 1 pairs of variables that are independent and 54 pairs that are likely not independent (suggesting potential multicollinearity).
| Variable 1 | Variable 2 | P-Value | Relationship | |
|---|---|---|---|---|
| 0 | JOB_TYPE_DESCRIPTION | CONTRACTOR | 0.000000e+00 | Not Independent |
| 1 | JOB_TYPE_DESCRIPTION | Property Type | 0.000000e+00 | Not Independent |
| 2 | JOB_TYPE_DESCRIPTION | Jobsourcedescription | 0.000000e+00 | Not Independent |
| 3 | JOB_TYPE_DESCRIPTION | Initial Priority Description | 0.000000e+00 | Not Independent |
| 4 | JOB_TYPE_DESCRIPTION | Latest Priority Description | 0.000000e+00 | Not Independent |
| 5 | JOB_TYPE_DESCRIPTION | JOB_STATUS_DESCRIPTION | 0.000000e+00 | Not Independent |
| 6 | JOB_TYPE_DESCRIPTION | TRADE_DESCRIPTION | 0.000000e+00 | Not Independent |
| 7 | JOB_TYPE_DESCRIPTION | ABANDON_REASON_DESC | 0.000000e+00 | Not Independent |
| 8 | JOB_TYPE_DESCRIPTION | SOR_DESCRIPTION | 0.000000e+00 | Not Independent |
| 9 | JOB_TYPE_DESCRIPTION | Mgt Area | 0.000000e+00 | Not Independent |
| 10 | CONTRACTOR | Property Type | 0.000000e+00 | Not Independent |
| 11 | CONTRACTOR | Jobsourcedescription | 0.000000e+00 | Not Independent |
| 12 | CONTRACTOR | Initial Priority Description | 0.000000e+00 | Not Independent |
| 13 | CONTRACTOR | Latest Priority Description | 0.000000e+00 | Not Independent |
| 14 | CONTRACTOR | JOB_STATUS_DESCRIPTION | 0.000000e+00 | Not Independent |
| 15 | CONTRACTOR | TRADE_DESCRIPTION | 0.000000e+00 | Not Independent |
| 16 | CONTRACTOR | ABANDON_REASON_DESC | 0.000000e+00 | Not Independent |
| 17 | CONTRACTOR | SOR_DESCRIPTION | 0.000000e+00 | Not Independent |
| 18 | CONTRACTOR | Mgt Area | 0.000000e+00 | Not Independent |
| 19 | Property Type | Jobsourcedescription | 0.000000e+00 | Not Independent |
| 20 | Property Type | Initial Priority Description | 0.000000e+00 | Not Independent |
| 21 | Property Type | Latest Priority Description | 0.000000e+00 | Not Independent |
| 22 | Property Type | JOB_STATUS_DESCRIPTION | 1.301157e-198 | Not Independent |
| 23 | Property Type | TRADE_DESCRIPTION | 0.000000e+00 | Not Independent |
| 24 | Property Type | ABANDON_REASON_DESC | 1.388700e-178 | Not Independent |
| 25 | Property Type | SOR_DESCRIPTION | 0.000000e+00 | Not Independent |
| 26 | Property Type | Mgt Area | 0.000000e+00 | Not Independent |
| 27 | Jobsourcedescription | Initial Priority Description | 0.000000e+00 | Not Independent |
| 28 | Jobsourcedescription | Latest Priority Description | 0.000000e+00 | Not Independent |
| 29 | Jobsourcedescription | JOB_STATUS_DESCRIPTION | 1.768120e-288 | Not Independent |
| 30 | Jobsourcedescription | TRADE_DESCRIPTION | 0.000000e+00 | Not Independent |
| 31 | Jobsourcedescription | ABANDON_REASON_DESC | 1.324861e-212 | Not Independent |
| 32 | Jobsourcedescription | SOR_DESCRIPTION | 0.000000e+00 | Not Independent |
| 33 | Jobsourcedescription | Mgt Area | 0.000000e+00 | Not Independent |
| 34 | Initial Priority Description | Latest Priority Description | 0.000000e+00 | Not Independent |
| 35 | Initial Priority Description | JOB_STATUS_DESCRIPTION | 0.000000e+00 | Not Independent |
| 36 | Initial Priority Description | TRADE_DESCRIPTION | 0.000000e+00 | Not Independent |
| 37 | Initial Priority Description | ABANDON_REASON_DESC | 0.000000e+00 | Not Independent |
| 38 | Initial Priority Description | SOR_DESCRIPTION | 0.000000e+00 | Not Independent |
| 39 | Initial Priority Description | Mgt Area | 1.116841e-219 | Not Independent |
| 40 | Latest Priority Description | JOB_STATUS_DESCRIPTION | 0.000000e+00 | Not Independent |
| 41 | Latest Priority Description | TRADE_DESCRIPTION | 0.000000e+00 | Not Independent |
| 42 | Latest Priority Description | ABANDON_REASON_DESC | 0.000000e+00 | Not Independent |
| 43 | Latest Priority Description | SOR_DESCRIPTION | 0.000000e+00 | Not Independent |
| 44 | Latest Priority Description | Mgt Area | 3.457970e-203 | Not Independent |
| 45 | JOB_STATUS_DESCRIPTION | TRADE_DESCRIPTION | 0.000000e+00 | Not Independent |
| 46 | JOB_STATUS_DESCRIPTION | ABANDON_REASON_DESC | 1.000000e+00 | Independent |
| 47 | JOB_STATUS_DESCRIPTION | SOR_DESCRIPTION | 0.000000e+00 | Not Independent |
| 48 | JOB_STATUS_DESCRIPTION | Mgt Area | 1.349964e-05 | Not Independent |
| 49 | TRADE_DESCRIPTION | ABANDON_REASON_DESC | 0.000000e+00 | Not Independent |
| 50 | TRADE_DESCRIPTION | SOR_DESCRIPTION | 0.000000e+00 | Not Independent |
| 51 | TRADE_DESCRIPTION | Mgt Area | 0.000000e+00 | Not Independent |
| 52 | ABANDON_REASON_DESC | SOR_DESCRIPTION | 0.000000e+00 | Not Independent |
| 53 | ABANDON_REASON_DESC | Mgt Area | 2.470823e-17 | Not Independent |
| 54 | SOR_DESCRIPTION | Mgt Area | 0.000000e+00 | Not Independent |
Inference from Chi-square Feature Independence test above--ΒΆ
(Multi-collinearity test for Categorical Variables) -ΒΆ
('JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Property Type', 'Jobsourcedescription', 'Initial Priority Description', 'Latest Priority Description', 'JOB_STATUS_DESCRIPTION', 'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC', 'SOR_DESCRIPTION', 'Mgt Area')
1-Interpretation: We can see above that all pairs of predictor categorical variables except one are likely not independent and potentially indicate multicollinearity.
# Check unique values in each column
for column in int_df_copy.columns:
unique_values =int_df_copy[column].unique()
unique_count = int_df_copy[column].nunique()
print(f"Number of unique values in {column}: {unique_count}")
Number of unique values in Job No: 21286 Number of unique values in Job Type: 44 Number of unique values in JOB_TYPE_DESCRIPTION: 44 Number of unique values in CONTRACTOR: 33 Number of unique values in Year of Build Date: 36 Number of unique values in Jobsourcedescription: 15 Number of unique values in Property Ref: 2078 Number of unique values in Property Type: 10 Number of unique values in Initial Priority: 27 Number of unique values in Initial Priority Description: 31 Number of unique values in Job Status: 6 Number of unique values in LATEST_PRIORITY: 27 Number of unique values in ABANDON_REASON_CODE: 20 Number of unique values in Day of Date Logged: 7 Number of unique values in SOR_CODE: 1073 Number of unique values in SOR_DESCRIPTION: 1063 Number of unique values in Date Logged: 548 Number of unique values in Mgt Area: 3 Number of unique values in TRADE_DESCRIPTION: 31 Number of unique values in Date Comp: 544 Number of unique values in Total Value: 3306 Number of unique values in ABANDON_REASON_DESC: 19 Number of unique values in JOB_STATUS_DESCRIPTION: 6 Number of unique values in Latest Priority Description: 26 Number of unique values in Year_comp_log: 2 Number of unique values in Year_comp_solved: 2 Number of unique values in Month_comp_log: 12 Number of unique values in Month_comp_solved: 12 Number of unique values in Day of Date Comp: 7 Number of unique values in Day_comp_log: 31 Number of unique values in Days Taken: 177 Number of unique values in Log_Total_Value: 3306
# Filtering the relevant columns
columns_of_interest = ['JOB_TYPE_DESCRIPTION', 'Property Type', 'Total Value']
data_filtered = int_df_copy[columns_of_interest]
# Create scatter plots
for column in columns_of_interest[:-1]: # Exclude the target variable
plt.figure(figsize=(6, 4))
sns.scatterplot(x=data_filtered[column], y=data_filtered['Total Value'])
plt.title(f'Scatter Plot of {column} vs. Total Value')
plt.xlabel(column)
plt.xticks(rotation=45)
plt.ylabel('Total Value')
plt.yticks(rotation=45)
plt.show()
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
# Filtering the relevant columns
columns_of_interest = ['JOB_TYPE_DESCRIPTION', 'Property Type', 'Total Value']
data_filtered = int_df_copy[columns_of_interest]
# Create dummy variables for the categorical features
reduced_data_dummies = pd.get_dummies(data_filtered, columns=['JOB_TYPE_DESCRIPTION', 'Property Type'], drop_first=True)
# Add a constant to the DataFrame for VIF calculation
reduced_data_with_constant = add_constant(reduced_data_dummies)
# Initialize DataFrame to store VIF values
vif_data_reduced = pd.DataFrame()
vif_data_reduced['Feature'] = reduced_data_with_constant.columns
# Calculate VIF for each feature
vif_data_reduced['VIF'] = [variance_inflation_factor(reduced_data_with_constant.values, i) for i in range(reduced_data_with_constant.shape[1])]
# Set thresholds for accepted VIF values
vif_threshold_low = 5
vif_threshold_moderate = 10
# Identify features with high VIF values
high_vif_features = vif_data_reduced[vif_data_reduced['VIF'] > vif_threshold_low]
# Plotting the VIF values
plt.figure(figsize=(10, 6))
sns.barplot(x='VIF', y='Feature', data=vif_data_reduced)
plt.title('Variance Inflation Factor (VIF) for each Feature in Reduced Dataset')
plt.xlabel('Variance Inflation Factor')
plt.ylabel('Feature')
# Rotate y-axis labels by 45 degrees
plt.yticks(rotation=15)
# Show the plot
plt.show()
# Print VIF values below the plot with categorization
for feature, vif_value in zip(high_vif_features['Feature'], high_vif_features['VIF']):
if vif_value <= vif_threshold_low:
category = 'Low Multicollinearity'
elif vif_value <= vif_threshold_moderate:
category = 'Moderate Multicollinearity'
else:
category = 'High Multicollinearity'
print(f'{feature}: VIF={vif_value:.2f} ({category})')
const: VIF=2113.40 (High Multicollinearity) JOB_TYPE_DESCRIPTION_Communal Area Building Safety Inspection: VIF=5.32 (Moderate Multicollinearity) JOB_TYPE_DESCRIPTION_Communal Gas Repairs: VIF=5.53 (Moderate Multicollinearity) JOB_TYPE_DESCRIPTION_Communal Responsive Repairs: VIF=8.70 (Moderate Multicollinearity) JOB_TYPE_DESCRIPTION_Fire Safety Equipment Inspections: VIF=7.94 (Moderate Multicollinearity) JOB_TYPE_DESCRIPTION_Gas Responsive Repairs: VIF=94.17 (High Multicollinearity) JOB_TYPE_DESCRIPTION_Responsive Repairs: VIF=137.82 (High Multicollinearity) JOB_TYPE_DESCRIPTION_Suspected Damp: VIF=21.13 (High Multicollinearity) JOB_TYPE_DESCRIPTION_Void Repairs: VIF=13.67 (High Multicollinearity) JOB_TYPE_DESCRIPTION_Water Hygiene Inspections: VIF=6.17 (Moderate Multicollinearity) JOB_TYPE_DESCRIPTION_XXXXXXAsbestos Inspections: VIF=6.09 (Moderate Multicollinearity) Property Type_Access direct: VIF=201.68 (High Multicollinearity) Property Type_Access via internal shared area: VIF=147.93 (High Multicollinearity) Property Type_Block No Shared Area: VIF=9.69 (Moderate Multicollinearity) Property Type_Default: VIF=34.61 (High Multicollinearity) Property Type_Detached: VIF=12.84 (High Multicollinearity) Property Type_End Terrace: VIF=288.12 (High Multicollinearity) Property Type_Semi Detached: VIF=135.95 (High Multicollinearity) Property Type_Terrace: VIF=344.64 (High Multicollinearity)
L2-Regularization in order to remove multicollinearity among predictors - Ridge RegressionΒΆ
# Filtering the relevant columns
columns_of_interest = ['JOB_TYPE_DESCRIPTION', 'Property Type', 'Total Value']
data_filtered = int_df_copy[columns_of_interest]
# Separate features and target variable
X = data_filtered.drop('Total Value', axis=1)
y = data_filtered['Total Value']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define preprocessing steps
numeric_features = [] # No numeric features in X
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
categorical_features = ['JOB_TYPE_DESCRIPTION', 'Property Type']
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(drop='first'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Combine preprocessing with Ridge regression in a pipeline
ridge_model = Pipeline(steps=[
('preprocessor', preprocessor),
('ridge', Ridge(alpha=1.0)) # Adjust alpha as needed
])
# Fit the model
ridge_model.fit(X_train, y_train)
# Calculate VIF after ridge regression
X_train_scaled = ridge_model.named_steps['preprocessor'].transform(X_train)
# Check the shape of X_train_scaled
print("Shape of X_train_scaled:", X_train_scaled.shape)
# If X_train_scaled is 1D, reshape it to 2D
if len(X_train_scaled.shape) == 1:
X_train_scaled = X_train_scaled.reshape(-1, 1)
# Convert sparse matrix to dense array before transposing
X_train_scaled_dense = X_train_scaled.toarray()
# Print the shape of X_train_scaled before creating the DataFrame
print("Shape of X_train_scaled before DataFrame conversion:", X_train_scaled_dense.shape)
# Convert X_train_scaled to a DataFrame without transposing
X_train_scaled_df = pd.DataFrame(X_train_scaled_dense, columns=ridge_model.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(input_features=categorical_features))
vif_after = pd.DataFrame()
vif_after["Variable"] = X_train_scaled_df.columns
vif_after["VIF"] = [variance_inflation_factor(X_train_scaled_df.values, i) for i in range(X_train_scaled_df.shape[1])]
print("VIF After Ridge Regression:")
print(vif_after)
Shape of X_train_scaled: (17028, 51)
Shape of X_train_scaled before DataFrame conversion: (17028, 51)
VIF After Ridge Regression:
Variable VIF
0 JOB_TYPE_DESCRIPTION_Asbestos Inspection Communal 1.049136
1 JOB_TYPE_DESCRIPTION_Asbestos Inspection Reactive 3.074842
2 JOB_TYPE_DESCRIPTION_Asbestos Inspections Planned 3.464880
3 JOB_TYPE_DESCRIPTION_Asbestos Inspections Void 1.439536
4 JOB_TYPE_DESCRIPTION_Asbestos Repairs Communal 1.073457
5 JOB_TYPE_DESCRIPTION_Asbestos Repairs Planned 1.488048
6 JOB_TYPE_DESCRIPTION_Asbestos Repairs Reactive 1.244229
7 JOB_TYPE_DESCRIPTION_Asbestos Repairs Void 1.147363
8 JOB_TYPE_DESCRIPTION_Commercial Lifts Inspections 3.623333
9 JOB_TYPE_DESCRIPTION_Communal Area Building Sa... 4.219575
10 JOB_TYPE_DESCRIPTION_Communal Gas Inspections 1.051527
11 JOB_TYPE_DESCRIPTION_Communal Gas Repairs 4.143715
12 JOB_TYPE_DESCRIPTION_Communal Responsive Repairs 7.071256
13 JOB_TYPE_DESCRIPTION_Domestic Lifts Inspections 1.854700
14 JOB_TYPE_DESCRIPTION_Domestic Lifts Repairs 1.782554
15 JOB_TYPE_DESCRIPTION_Door Access Control Repai... 2.136295
16 JOB_TYPE_DESCRIPTION_Door Inspection and Repairs 1.836389
17 JOB_TYPE_DESCRIPTION_Fire Risk Repairs 1.553377
18 JOB_TYPE_DESCRIPTION_Fire Risk Repairs Planned 1.855465
19 JOB_TYPE_DESCRIPTION_Fire Safety Equipment Ins... 5.882171
20 JOB_TYPE_DESCRIPTION_Fire Safety Equipment Rep... 2.826031
21 JOB_TYPE_DESCRIPTION_Gas Exclusion 1.073471
22 JOB_TYPE_DESCRIPTION_Gas Responsive Repairs 82.466475
23 JOB_TYPE_DESCRIPTION_Gate and Barrier Repairs 1.125804
24 JOB_TYPE_DESCRIPTION_Lifts Consultants 1.859335
25 JOB_TYPE_DESCRIPTION_Lightning Conductors and ... 1.028323
26 JOB_TYPE_DESCRIPTION_PAT Testing 1.491652
27 JOB_TYPE_DESCRIPTION_Play Equipment Inspections 1.063840
28 JOB_TYPE_DESCRIPTION_Play Equipment Repairs 1.255359
29 JOB_TYPE_DESCRIPTION_Pre-Inspection 1.660888
30 JOB_TYPE_DESCRIPTION_Rechargeable Repairs 2.587467
31 JOB_TYPE_DESCRIPTION_Responsive Repairs 262.218850
32 JOB_TYPE_DESCRIPTION_Schedule Repairs Visit 1.176209
33 JOB_TYPE_DESCRIPTION_Section 11 Repairs 2.123408
34 JOB_TYPE_DESCRIPTION_Suspected Damp 15.866800
35 JOB_TYPE_DESCRIPTION_Tenant Doing Own Repair 1.122102
36 JOB_TYPE_DESCRIPTION_Void Repairs 9.937705
37 JOB_TYPE_DESCRIPTION_Warden Call Equipment Rep... 1.497905
38 JOB_TYPE_DESCRIPTION_Water Hygiene Inspections 4.878579
39 JOB_TYPE_DESCRIPTION_Water Risk Inspection 1.206107
40 JOB_TYPE_DESCRIPTION_XXXXXAsbestos Repairs 3.367809
41 JOB_TYPE_DESCRIPTION_XXXXXXAsbestos Inspections 4.832181
42 Property Type_Access direct 65.102819
43 Property Type_Access via internal shared area 46.849289
44 Property Type_Block No Shared Area 3.491441
45 Property Type_Default 10.875865
46 Property Type_Detached 4.232997
47 Property Type_End Terrace 105.008867
48 Property Type_Other Non-Rentable Space 2.106712
49 Property Type_Semi Detached 42.822193
50 Property Type_Terrace 144.167964
# Filtering the relevant columns
columns_of_interest = ['JOB_TYPE_DESCRIPTION', 'Property Type', 'Total Value']
data_filtered = int_df_copy[columns_of_interest]
# Separate features and target variable
X = data_filtered.drop('Total Value', axis=1)
y = data_filtered['Total Value']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define preprocessing steps
numeric_features = [] # No numeric features in X
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
categorical_features = ['JOB_TYPE_DESCRIPTION', 'Property Type']
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(drop='first'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Combine preprocessing with Ridge regression in a pipeline
ridge_model = Pipeline(steps=[
('preprocessor', preprocessor),
('ridge', Ridge(alpha=1.0)) # Adjust alpha as needed
])
# Fit the model
ridge_model.fit(X_train, y_train)
# Calculate VIF after ridge regression
X_train_scaled = ridge_model.named_steps['preprocessor'].transform(X_train)
# If X_train_scaled is 1D, reshape it to 2D
if len(X_train_scaled.shape) == 1:
X_train_scaled = X_train_scaled.reshape(-1, 1)
# Convert sparse matrix to dense array before transposing
X_train_scaled_dense = X_train_scaled.toarray()
# Print the shape of X_train_scaled before creating the DataFrame
print("Shape of X_train_scaled before DataFrame conversion:", X_train_scaled_dense.shape)
# Convert X_train_scaled to a DataFrame without transposing
X_train_scaled_df = pd.DataFrame(X_train_scaled_dense, columns=ridge_model.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(input_features=categorical_features))
# Function to calculate VIF
def calculate_vif(data_frame):
vif_data = pd.DataFrame()
vif_data["Variable"] = data_frame.columns
vif_data["VIF"] = [variance_inflation_factor(data_frame.values, i) for i in range(data_frame.shape[1])]
return vif_data
# Check and handle multicollinearity
vif_threshold = 5 # Adjust the threshold as needed
max_vif = vif_threshold + 1 # Initialize max_vif to enter the loop
while max_vif > vif_threshold:
vif_data = calculate_vif(X_train_scaled_df)
max_vif = vif_data["VIF"].max()
if max_vif > vif_threshold:
print(f"Removing feature with high VIF: {vif_data.loc[vif_data['VIF'].idxmax()]['Variable']} (VIF: {max_vif})")
X_train_scaled_df = X_train_scaled_df.drop(vif_data.loc[vif_data['VIF'].idxmax()]['Variable'], axis=1)
else:
print("All VIF values are below the threshold.")
Shape of X_train_scaled before DataFrame conversion: (17028, 51) Removing feature with high VIF: JOB_TYPE_DESCRIPTION_Responsive Repairs (VIF: 262.21885024598726) All VIF values are below the threshold.
Inference (Unsuitability of linear regression)ΒΆ
Non-monotonic relationship (non-linear) betwee nominal predictor and response variable due to the presence of outlier data points and uneven data distribution of the Repair cost for a vast majority (approx. 8) of predictors.
High to moderate Multicollinearity among most of the nominal predictors as confirmed by VIF and Chi-square statistical independence test.
Autocorrelation among residuals hinting models lower explanatory power hinting at linear regrssion model's inability to model the relationshiop between respoonse and predictors(ACF Plot)
# Import necessary libraries
import statsmodels.api as sm
# Build the Ridge Regression Model
ridge_model.fit(X_train, y_train)
y_pred_train = ridge_model.predict(X_train)
# Get Residuals
residuals = y_train - ridge_model.predict(X_train)
# Independence (Residual Analysis): Autocorrelation Function (ACF) Plot
acf_values, conf_int = sm.tsa.acf(residuals, nlags=40, alpha=0.05)
# Plot ACF
plt.stem(range(len(acf_values)), acf_values, use_line_collection=True)
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.title('Autocorrelation Function (ACF) Plot', fontweight ="bold", fontsize = 16)
plt.show()
# Print ACF values
print("ACF Values at Different Lags:")
for lag, acf_value in enumerate(acf_values):
print(f"Lag {lag}: {acf_value}")
#Independence (Residual Analysis):
# sm.graphics.tsa.plot_acf(residuals, lags=40, title='Autocorrelation Function (ACF)')
# plt.show()
#Homoscedasticity
plt.scatter(ridge_model.predict(X_train), residuals)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Fitted Values', fontweight ="bold", fontsize = 16)
plt.axhline(y=0, color='r', linestyle='--')
plt.show()
#Normality of Residuals (Quartile-Quartile Plot):
sm.graphics.qqplot(residuals, line='45', fit=True)
plt.title('Quantile-Quantile Plot of Residuals',fontweight ="bold", fontsize = 16)
plt.show()
# Calculate R-squared
r_squared = r2_score(y_train, y_pred_train)
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
# Print R-squared and RMSE
print(f"R-squared: {r_squared:.4f}")
print(f"RMSE: {rmse:.4f}")
# Plotting the regression line
plt.scatter(ridge_model.predict(X_train), residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Fitted Values with Regression Line', fontweight ="bold", fontsize = 16)
plt.show()
# Plotting the best-fit line
plt.scatter(y_train, y_pred_train)
plt.plot([min(y_train), max(y_train)], [min(y_train), max(y_train)], color='red', linestyle='--', linewidth=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values with Best-Fit Line',fontweight ="bold", fontsize = 16)
plt.show()
C:\Users\dmish\AppData\Local\Temp\ipykernel_23928\1006536677.py:15: MatplotlibDeprecationWarning: The 'use_line_collection' parameter of stem() was deprecated in Matplotlib 3.6 and will be removed two minor releases later. If any parameter follows 'use_line_collection', they should be passed as keyword, not positionally.
ACF Values at Different Lags: Lag 0: 1.0 Lag 1: 0.0069407418098141925 Lag 2: 0.00798015199009071 Lag 3: -0.0026370214109830404 Lag 4: -0.006616718064611427 Lag 5: -0.0018778463626996412 Lag 6: -0.004026817480861413 Lag 7: 0.010068283430081934 Lag 8: -0.001747789200518679 Lag 9: -0.0018649998182010475 Lag 10: -0.007089760573655907 Lag 11: 0.015675278069523197 Lag 12: 0.0030133311961603503 Lag 13: 0.017034144563842956 Lag 14: 0.011191286696376061 Lag 15: -0.014524684722626685 Lag 16: -0.002240533070826717 Lag 17: 0.008816849474779148 Lag 18: -0.0002630124978918425 Lag 19: 0.013697512059962643 Lag 20: 0.02024607356315158 Lag 21: -0.0022966733239824424 Lag 22: -0.010518351276925755 Lag 23: -0.0010217216151274721 Lag 24: -0.002339066251558205 Lag 25: -5.7873493254899856e-05 Lag 26: -0.0049551478581956725 Lag 27: -0.00748269332424953 Lag 28: -0.00972685531368201 Lag 29: 0.0011907664407536335 Lag 30: -0.002920357310116244 Lag 31: -0.007586310543475814 Lag 32: 0.010835265255216527 Lag 33: -0.005125472122337253 Lag 34: -0.0006042886599885133 Lag 35: 0.0037422522103615647 Lag 36: -0.0050272880526999455 Lag 37: -0.0015824926761148865 Lag 38: 0.019631086276000163 Lag 39: -0.002188306393802689 Lag 40: 0.013307672323946083
R-squared: 0.1635 RMSE: 587.2368
Inference (Ridge Regression model):ΒΆ
1- The R-squared value of 0.1635 indicates that the Ridge Regression model explains approximately 16.35% of the variance in the target variable. It means that the model is not capturing a large proportion of the variability in the data.
2- The RMSE (Root Mean Squared Error) of 587.2368 represents the average magnitude of the errors between the actual and predicted values. A lower RMSE is desirable, but the absolute interpretation depends on the scale of the target variable.
Summary,ΒΆ
Here the model is providing explanation of the (approx 17% ) variance in the data, not adequate.
Feature Importance Analysis (with "Total Value" as Response Variable and others as Predictor Variables)ΒΆ
This code will create three subplots: one for each of the feature importance methods.ΒΆ
1- The RandomForest feature importance is based on impurity reduction, 2- RFE ia an iterative feature selection technique. It starts with all features and eliminates the least important ones in each iteration until the desired number of features is reached. The order of elimination provides a ranking of feature importance. 3- the permutation importance is based on the impact of feature shuffling on model accuracy. 4- One-Way ANOVA F-value for the feature selection. This method assesses the importance of each categorical variable in explaining the variance in the numerical target
####### Column Unique value Counts 0 Job No 21286 1 Job Type 44 2 JOB_TYPE_DESCRIPTION 44 3 CONTRACTOR 33 4 Year of Build Date 36 5 Jobsourcedescription 15 6 Property Ref 2078 7 Property Type 10 8 Initial Priority 27 9 Initial Priority Description 31 10 Job Status 6 11 LATEST_PRIORITY 27 12 ABANDON_REASON_CODE 20 13 Day of Date Logged 548 14 SOR_CODE 1073 15 SOR_DESCRIPTION 1063 16 Date Logged 548 17 Mgt Area 3 18 TRADE_DESCRIPTION 31 19 Date Comp 545 20 Total Value 3306 21 ABANDON_REASON_DESC 19 22 JOB_STATUS_DESCRIPTION 6 23 Latest Priority Description 26
# Selected columns
selected_columns = ['JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Property Type', 'Jobsourcedescription',
'Initial Priority Description', 'Latest Priority Description', 'JOB_STATUS_DESCRIPTION',
'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC', 'SOR_DESCRIPTION', 'Mgt Area', 'Total Value']
# Create the new DataFrame with relevant columns
data_relevant = int_df_copy[selected_columns]
data_relevant = data_relevant.dropna(subset=['Total Value'])
# Encoding categorical variables
label_encoders = {}
for column in data_relevant.select_dtypes(include=['object']).columns:
label_encoders[column] = LabelEncoder()
data_relevant[column] = label_encoders[column].fit_transform(data_relevant[column].astype(str))
# Imputing missing values
imputer = SimpleImputer(strategy='mean')
data_relevant = pd.DataFrame(imputer.fit_transform(data_relevant), columns=data_relevant.columns)
# Separating the target variable and features
X = data_relevant.drop('Total Value', axis=1)
y = data_relevant['Total Value']
# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Model training
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)
# Feature importance
feature_importance = model.feature_importances_
sorted_idx = np.argsort(feature_importance)
feature_names_sorted = X.columns[sorted_idx]
importance_sorted = feature_importance[sorted_idx]
# Plotting
# plt.figure(figsize=(6, 4))
# plt.barh(feature_names_sorted, importance_sorted)
# plt.xlabel('Importance')
# plt.ylabel('Feature')
# plt.title('Random Forest Regressor - Feature Importance for Predicting Total Value')
# plt.show()
plt.figure(figsize=(6, 4)) # Larger figure size
bars = plt.barh(feature_names_sorted, importance_sorted)
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Random Forest Regressor - Feature Importance for Predicting Total Value',fontweight ="bold", fontsize = 16)
# Adding simplified text annotations on top of each bar
for bar in bars:
if bar.get_width() > 0: # Only annotate bars wider than a threshold
plt.text(bar.get_width(), bar.get_y() + bar.get_height() / 2,
f'{bar.get_width():.2f}', # Rounded to two decimal places
va='center', ha='left') # Adjust text alignment if needed
plt.show()
# Create a DataFrame to display the feature names and importances
feature_importance_RF_df = pd.DataFrame({'Feature': feature_names_sorted, 'Importance': importance_sorted})
feature_importance_RF_df = feature_importance_RF_df.sort_values(by='Importance', ascending=False)
# Display the feature names and importances as a DataFrame
print("Random Forest Regressor - Feature Importance:")
print(feature_importance_RF_df)
#######################################################################################################################
# ANOVA F-value feature selection
f_values, p_values = f_classif(X_train, y_train)
# Sorting indices by feature importance
sorted_indices = np.argsort(f_values)[::-1][:-1]
# Create a DataFrame to display the feature names and ANOVA F-values
feature_importance_anova_df = pd.DataFrame({'Feature': X_train.columns[sorted_indices], 'ANOVA F-value': f_values[sorted_indices]})
# Plotting
plt.figure(figsize=(6, 4))
# ANOVA F-values
plt.barh(range(len(sorted_indices)), f_values[sorted_indices][::-1], color='skyblue') # Reverse the order
plt.yticks(range(len(sorted_indices)), X_train.columns[sorted_indices][::-1]) # Reverse the order
plt.xlabel('ANOVA F-value')
plt.ylabel('Features')
plt.title('ANOVA F-value - Feature Importance',fontweight ="bold", fontsize = 16)
# Display descending scores below the plot
for i, value in enumerate(f_values[sorted_indices][::-1]): # Reverse the order
plt.text(value, i, f'{value:.2f}', va='center', fontsize=8)
plt.tight_layout()
plt.show()
# Display the feature names and ANOVA F-values as a DataFrame
print("ANOVA F-value - Feature Importance:")
print(feature_importance_anova_df)
#######################################################################################################################
# Convert X_test to dense array for permutation importance
# X_test_dense = X_test.toarray()
# Permutation Importance
perm_importance = permutation_importance(model, X_test, y_test, n_repeats=40, random_state=42)
# Permutation Importance
plt.figure(figsize=(6, 4)) # Adjust the figure size as needed
plt.subplot(1, 1, 1) # Modify subplot parameters if needed
sorted_idx_perm = perm_importance.importances_mean.argsort()
plt.barh(range(len(sorted_idx_perm)), perm_importance.importances_mean[sorted_idx_perm])
plt.yticks(range(len(sorted_idx_perm)), X_test.columns[sorted_idx_perm]) # Add y-axis labels
plt.title('Permutation Importance',fontweight ="bold", fontsize = 16)
# Display descending scores below the plot
for i, value in enumerate(perm_importance.importances_mean[sorted_idx_perm]):
plt.text(value, i, f'{value:.4f}', va='center', fontsize=8)
plt.tight_layout()
plt.show()
# Create a DataFrame to display the feature names and permutation importances
perm_importance_df = pd.DataFrame({'Feature': X_test.columns[sorted_idx_perm], 'Permutation Importance': perm_importance.importances_mean[sorted_idx_perm]})
perm_importance_df = perm_importance_df.sort_values(by='Permutation Importance', ascending=False)
# Display the feature names and permutation importances as a DataFrame
print("Permutation- Feature Importance:")
print(perm_importance_df)
#######################################################################################################################
linear_model = LinearRegression()
rfe = RFE(estimator=linear_model, n_features_to_select=5)
rfe.fit(X_train, y_train)
# Create a DataFrame to store the feature names and rankings
rfe_ranking_df = pd.DataFrame({'Feature': X_train.columns, 'Ranking': rfe.ranking_})
# Sort the DataFrame by RFE rankings
rfe_ranking_df_sorted = rfe_ranking_df.sort_values(by='Ranking')
# Plotting
plt.figure(figsize=(10, 6)) # Adjust the figure size as needed
# Bar plot
plt.subplot(2, 1, 1) # Updated subplot to accommodate two plots
plt.barh(range(len(rfe_ranking_df_sorted)), rfe_ranking_df_sorted['Ranking'])
plt.yticks(range(len(rfe_ranking_df_sorted)), rfe_ranking_df_sorted['Feature']) # Add y-axis labels
plt.xlabel('RFE Feature Ranking (Lower is Better)')
plt.title('RFE Feature Ranking',fontweight ="bold", fontsize = 16)
# Display the rankings on the plot
for i, value in enumerate(rfe_ranking_df_sorted['Ranking']):
plt.text(value, i, f'{value}', va='center', fontsize=8)
plt.tight_layout() # Adjust layout
plt.show()
# Display the sorted DataFrame below the plot
print("\nRFE Feature Ranking DataFrame (Sorted):")
print(rfe_ranking_df_sorted)
Random Forest Regressor - Feature Importance:
Feature Importance
10 Jobsourcedescription 0.376502
9 Initial Priority Description 0.178428
8 SOR_DESCRIPTION 0.163341
7 JOB_TYPE_DESCRIPTION 0.080887
6 JOB_STATUS_DESCRIPTION 0.056166
5 TRADE_DESCRIPTION 0.052513
4 Property Type 0.047050
3 Latest Priority Description 0.023388
2 ABANDON_REASON_DESC 0.018219
1 Mgt Area 0.002255
0 CONTRACTOR 0.001251
ANOVA F-value - Feature Importance:
Feature ANOVA F-value
0 CONTRACTOR 8.740598
1 JOB_TYPE_DESCRIPTION 6.644850
2 ABANDON_REASON_DESC 2.791157
3 JOB_STATUS_DESCRIPTION 2.539255
4 TRADE_DESCRIPTION 2.509081
5 SOR_DESCRIPTION 2.155007
6 Mgt Area 1.789569
7 Latest Priority Description 1.648148
8 Property Type 1.378105
9 Jobsourcedescription 1.287345
Permutation- Feature Importance:
Feature Permutation Importance
10 Jobsourcedescription 1.312596
9 JOB_TYPE_DESCRIPTION 0.982036
8 Initial Priority Description 0.342225
7 SOR_DESCRIPTION 0.103954
6 Latest Priority Description 0.048152
5 JOB_STATUS_DESCRIPTION 0.038313
4 TRADE_DESCRIPTION 0.033952
3 ABANDON_REASON_DESC 0.022120
2 Property Type 0.021027
1 CONTRACTOR 0.001586
0 Mgt Area -0.000155
RFE Feature Ranking DataFrame (Sorted):
Feature Ranking
0 JOB_TYPE_DESCRIPTION 1
3 Jobsourcedescription 1
6 JOB_STATUS_DESCRIPTION 1
8 ABANDON_REASON_DESC 1
10 Mgt Area 1
4 Initial Priority Description 2
5 Latest Priority Description 3
1 CONTRACTOR 4
2 Property Type 5
7 TRADE_DESCRIPTION 6
9 SOR_DESCRIPTION 7
features = X.columns
rf_ranked = feature_importance_RF_df.set_index('Feature').reindex(features)['Importance'].rank(ascending=False)
anova_ranked = feature_importance_anova_df.set_index('Feature').reindex(features)['ANOVA F-value'].rank(ascending=False)
perm_ranked = perm_importance_df.set_index('Feature').reindex(features)['Permutation Importance'].rank(ascending=False)
rfe_ranked = rfe_ranking_df_sorted.set_index('Feature').reindex(features)['Ranking'].rank(ascending=True)
# Combine the rankings into one DataFrame
feature_importance_all = pd.DataFrame({
'Feature': features,
'RF': rf_ranked.values,
'ANOVA': anova_ranked.values,
'Permutation': perm_ranked.values,
'RFE': rfe_ranked.values
})
# Normalize the importance scores for better visualization
feature_importance_all_normalized = feature_importance_all.copy()
feature_importance_all_normalized.iloc[:, 1:] = feature_importance_all.iloc[:, 1:].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
# Plotting parallel coordinates
plt.figure(figsize=(12, 8))
parallel_coordinates(feature_importance_all_normalized, class_column='Feature', colormap='viridis')
plt.title('Feature Importance Comparison', fontweight ="bold", fontsize = 16)
plt.show()
feature_importance_all
| Feature | RF | ANOVA | Permutation | RFE | |
|---|---|---|---|---|---|
| 0 | JOB_TYPE_DESCRIPTION | 4.0 | 2.0 | 2.0 | 3.0 |
| 1 | CONTRACTOR | 11.0 | 1.0 | 10.0 | 8.0 |
| 2 | Property Type | 7.0 | 9.0 | 9.0 | 9.0 |
| 3 | Jobsourcedescription | 1.0 | 10.0 | 1.0 | 3.0 |
| 4 | Initial Priority Description | 2.0 | NaN | 3.0 | 6.0 |
| 5 | Latest Priority Description | 8.0 | 8.0 | 5.0 | 7.0 |
| 6 | JOB_STATUS_DESCRIPTION | 5.0 | 4.0 | 6.0 | 3.0 |
| 7 | TRADE_DESCRIPTION | 6.0 | 5.0 | 7.0 | 10.0 |
| 8 | ABANDON_REASON_DESC | 9.0 | 3.0 | 8.0 | 3.0 |
| 9 | SOR_DESCRIPTION | 3.0 | 6.0 | 4.0 | 11.0 |
| 10 | Mgt Area | 10.0 | 7.0 | 11.0 | 3.0 |
# Normalize importance scores
feature_importance_all['RF_normalized'] = feature_importance_all['RF'] / feature_importance_all['RF'].max()
feature_importance_all['ANOVA_normalized'] = feature_importance_all['ANOVA'].fillna(0) / feature_importance_all['ANOVA'].max()
feature_importance_all['Permutation_normalized'] = feature_importance_all['Permutation'] / feature_importance_all['Permutation'].max()
# Invert RFE ranks for normalization
rfe_max = feature_importance_all['RFE'].max() + 1
feature_importance_all['RFE_normalized'] = (rfe_max - feature_importance_all['RFE']) / rfe_max
print(feature_importance_all)
# Plotting stacked bar chart
plt.figure(figsize=(8, 6))
for i, row in feature_importance_all.iterrows():
plt.bar(row['Feature'], height=row['RF_normalized'], color='b', edgecolor='black', label='RF' if i == 0 else "")
plt.bar(row['Feature'], height=row['ANOVA_normalized'], bottom=row['RF_normalized'], color='g', edgecolor='black', label='ANOVA' if i == 0 else "")
plt.bar(row['Feature'], height=row['Permutation_normalized'], bottom=row['RF_normalized'] + row['ANOVA_normalized'], color='r', edgecolor='black', label='Permutation' if i == 0 else "")
plt.bar(row['Feature'], height=row['RFE_normalized'], bottom=row['RF_normalized'] + row['ANOVA_normalized'] + row['Permutation_normalized'], color='y', edgecolor='black', label='RFE' if i == 0 else "")
plt.xlabel('Feature')
plt.xticks(rotation=45, ha='right')
plt.ylabel('Normalized Importance')
plt.title('Feature Importance Comparison (Stacked Bar Chart)',fontweight ="bold", fontsize = 16)
plt.legend()
plt.show()
Feature RF ANOVA Permutation RFE \
0 JOB_TYPE_DESCRIPTION 4.0 2.0 2.0 3.0
1 CONTRACTOR 11.0 1.0 10.0 8.0
2 Property Type 7.0 9.0 9.0 9.0
3 Jobsourcedescription 1.0 10.0 1.0 3.0
4 Initial Priority Description 2.0 NaN 3.0 6.0
5 Latest Priority Description 8.0 8.0 5.0 7.0
6 JOB_STATUS_DESCRIPTION 5.0 4.0 6.0 3.0
7 TRADE_DESCRIPTION 6.0 5.0 7.0 10.0
8 ABANDON_REASON_DESC 9.0 3.0 8.0 3.0
9 SOR_DESCRIPTION 3.0 6.0 4.0 11.0
10 Mgt Area 10.0 7.0 11.0 3.0
RF_normalized ANOVA_normalized Permutation_normalized RFE_normalized
0 0.363636 0.2 0.181818 0.750000
1 1.000000 0.1 0.909091 0.333333
2 0.636364 0.9 0.818182 0.250000
3 0.090909 1.0 0.090909 0.750000
4 0.181818 0.0 0.272727 0.500000
5 0.727273 0.8 0.454545 0.416667
6 0.454545 0.4 0.545455 0.750000
7 0.545455 0.5 0.636364 0.166667
8 0.818182 0.3 0.727273 0.750000
9 0.272727 0.6 0.363636 0.083333
10 0.909091 0.7 1.000000 0.750000
Ensemble Mix of best feature selection (based on majority voting consesus) between different feature selection techniquesΒΆ
Dash-Interactive AppΒΆ
Step 1: Determine the Consensus on Important Features threshold for how many methods need to agree on a feature being important. For this example, let's say a feature is considered important if at least 3 out of the 4 methods agree.
Note - This is a Dash-Interactive app. Please note that for interaction it needs to be hosted on some service providers platform as a standalone web application.
User Customised selection of Threshold (of no of feature selection methods)ΒΆ
# Normalize importance scores and invert RFE ranks
feature_importance_all['RF_normalized'] = feature_importance_all['RF'] / feature_importance_all['RF'].max()
feature_importance_all['ANOVA_normalized'] = feature_importance_all['ANOVA'].fillna(0) / feature_importance_all['ANOVA'].max()
feature_importance_all['Permutation_normalized'] = feature_importance_all['Permutation'] / feature_importance_all['Permutation'].max()
rfe_max = feature_importance_all['RFE'].max() + 1
feature_importance_all['RFE_normalized'] = (rfe_max - feature_importance_all['RFE']) / rfe_max
app = dash.Dash(__name__)
app.layout = html.Div([
html.H4("Feature Selection Technique:", style={'textAlign': 'center'}),
dcc.Dropdown(
id='feature-selection-dropdown',
options=[
{'label': 'Random Forest Regressor', 'value': 'RF_normalized'},
{'label': 'ANOVA-F Value', 'value': 'ANOVA_normalized'},
{'label': 'Permutation Importance', 'value': 'Permutation_normalized'},
{'label': 'RFE(Recursive Feature Selection)', 'value': 'RFE_normalized'}
],
value='RF_normalized'
),
dcc.Graph(id='feature-importance-graph')
], style={'textAlign': 'center'})
@app.callback(
Output('feature-importance-graph', 'figure'),
[Input('feature-selection-dropdown', 'value')]
)
def update_graph(selected_technique):
# Sort the DataFrame based on the selected technique's score in descending order
sorted_df = feature_importance_all.sort_values(by=selected_technique, ascending=False)
# Create the bar plot with the sorted DataFrame
fig = px.bar(sorted_df, x='Feature', y=selected_technique, color='Feature')
fig.update_layout(
title='<b>Feature Importance Comparison</b>', title_x=0.5,
xaxis_title='<b>Feature</b>',
yaxis_title='<b>Normalized Importance</b>',
legend_title='<b>Feature Selection Technique</b>'
)
return fig
if __name__ == '__main__':
app.run_server(debug=True, port=8052)
User custom selection of No of feature selection techniques to be used for consensus on feature importanceΒΆ
Note - This is a Dash-Interactive app. Please note that for interaction it needs to be hosted on some service providers platform as a standalone web application.
consensus_threshold = 1
# Mark features as important (1) or not important (0) in each method
feature_importance_all['RF_important'] = (feature_importance_all['RF'] >= feature_importance_all['RF'].median()).astype(int)
feature_importance_all['ANOVA_important'] = (feature_importance_all['ANOVA'] >= feature_importance_all['ANOVA'].median()).astype(int)
feature_importance_all['Permutation_important'] = (feature_importance_all['Permutation'] >= feature_importance_all['Permutation'].median()).astype(int)
feature_importance_all['RFE_important'] = (feature_importance_all['RFE'] <= feature_importance_all['RFE'].median()).astype(int) # Lower rank is better
# Calculate the total number of methods that find each feature important
feature_importance_all['Total_important'] = (feature_importance_all[['RF_important', 'ANOVA_important', 'Permutation_important', 'RFE_important']].sum(axis=1))
print(feature_importance_all[['Feature', 'Total_important']])
# Select features that meet the consensus threshold
consensus_features = feature_importance_all[feature_importance_all['Total_important'] >= consensus_threshold]
# Prepare the data for plotting
consensus_feature_importance = consensus_features[['RF', 'ANOVA', 'Permutation', 'RFE']]
consensus_feature_importance = consensus_feature_importance.rename(columns={
'RF': 'Random Forest Regressor',
'ANOVA': 'Anova F-Value',
'Permutation': 'Permutation Feature Importance',
'RFE': 'Recursive Feature Elimination'
})
app = dash.Dash(__name__)
app.layout = html.Div([
html.H4("Feature Selection Consensus Threshold:", style={'textAlign': 'center'}),
dcc.Dropdown(
id='consensus-threshold-dropdown',
options=[{'label': i, 'value': i} for i in range(1, 5)],
value=2
),
dcc.Graph(id='feature-importance-graph')
], style={'textAlign': 'center'})
@app.callback(
Output('feature-importance-graph', 'figure'),
[Input('consensus-threshold-dropdown', 'value')]
)
def update_graph(consensus_threshold):
consensus_features = feature_importance_all[feature_importance_all['Total_important'] >= consensus_threshold]
consensus_feature_importance = consensus_features[['RF', 'ANOVA', 'Permutation', 'RFE']]
consensus_feature_importance = consensus_feature_importance.rename(columns={
'RF': 'Random Forest Regressor',
'ANOVA': 'Anova F-Value',
'Permutation': 'Permutation Feature Importance',
'RFE': 'Recursive Feature Elimination'
})
consensus_feature_importance.index = consensus_features['Feature']
fig = go.Figure()
for col in consensus_feature_importance.columns:
fig.add_trace(go.Bar(x=consensus_feature_importance.index,
y=consensus_feature_importance[col],
name=col))
fig.update_layout(barmode='stack',
title='<b>Consensus Among Feature Selection Methods - Feature Importance</b>', title_x=0.5,
xaxis_title='<b>Feature</b>',
yaxis_title='<b>Importance Score</b>',
legend_title='<b>Feature Importance Selection Method')
return fig
if __name__ == '__main__':
app.run_server(debug=True,port=8053)
Feature Total_important 0 JOB_TYPE_DESCRIPTION 1 1 CONTRACTOR 2 2 Property Type 3 3 Jobsourcedescription 2 4 Initial Priority Description 1 5 Latest Priority Description 2 6 JOB_STATUS_DESCRIPTION 2 7 TRADE_DESCRIPTION 2 8 ABANDON_REASON_DESC 3 9 SOR_DESCRIPTION 1 10 Mgt Area 4
Displaying the Feature Selection Values (Normalized and actual scores)ΒΆ
feature_importance_all
| Feature | RF | ANOVA | Permutation | RFE | RF_normalized | ANOVA_normalized | Permutation_normalized | RFE_normalized | RF_important | ANOVA_important | Permutation_important | RFE_important | Total_important | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | JOB_TYPE_DESCRIPTION | 4.0 | 2.0 | 2.0 | 3.0 | 0.363636 | 0.2 | 0.181818 | 0.750000 | 0 | 0 | 0 | 1 | 1 |
| 1 | CONTRACTOR | 11.0 | 1.0 | 10.0 | 8.0 | 1.000000 | 0.1 | 0.909091 | 0.333333 | 1 | 0 | 1 | 0 | 2 |
| 2 | Property Type | 7.0 | 9.0 | 9.0 | 9.0 | 0.636364 | 0.9 | 0.818182 | 0.250000 | 1 | 1 | 1 | 0 | 3 |
| 3 | Jobsourcedescription | 1.0 | 10.0 | 1.0 | 3.0 | 0.090909 | 1.0 | 0.090909 | 0.750000 | 0 | 1 | 0 | 1 | 2 |
| 4 | Initial Priority Description | 2.0 | NaN | 3.0 | 6.0 | 0.181818 | 0.0 | 0.272727 | 0.500000 | 0 | 0 | 0 | 1 | 1 |
| 5 | Latest Priority Description | 8.0 | 8.0 | 5.0 | 7.0 | 0.727273 | 0.8 | 0.454545 | 0.416667 | 1 | 1 | 0 | 0 | 2 |
| 6 | JOB_STATUS_DESCRIPTION | 5.0 | 4.0 | 6.0 | 3.0 | 0.454545 | 0.4 | 0.545455 | 0.750000 | 0 | 0 | 1 | 1 | 2 |
| 7 | TRADE_DESCRIPTION | 6.0 | 5.0 | 7.0 | 10.0 | 0.545455 | 0.5 | 0.636364 | 0.166667 | 1 | 0 | 1 | 0 | 2 |
| 8 | ABANDON_REASON_DESC | 9.0 | 3.0 | 8.0 | 3.0 | 0.818182 | 0.3 | 0.727273 | 0.750000 | 1 | 0 | 1 | 1 | 3 |
| 9 | SOR_DESCRIPTION | 3.0 | 6.0 | 4.0 | 11.0 | 0.272727 | 0.6 | 0.363636 | 0.083333 | 0 | 1 | 0 | 0 | 1 |
| 10 | Mgt Area | 10.0 | 7.0 | 11.0 | 3.0 | 0.909091 | 0.7 | 1.000000 | 0.750000 | 1 | 1 | 1 | 1 | 4 |
Analysing the Feature Selection Techniques Employed in our contextΒΆ
Analyzing the feature importance results from different methods and understanding why they might not converge requires considering the nature of each method and the characteristics of the dataset:
Analyzing Results from Different Methods:ΒΆ
Random Forest Regressor:ΒΆ
This method calculates importance based on how much each feature decreases the impurity of the split (e.g., Gini impurity for classification, variance for regression). Features that lead to more significant impurity decrease are considered more important. Here 'Jobsourcedescription' and 'Initial Priority Description' are top features.
ANOVA F-value:ΒΆ
This method is based on statistical tests to measure the impact of each feature on the variance of the response variable. It's sensitive to linear relationships. Features like 'CONTRACTOR' and 'JOB_TYPE_DESCRIPTION' showing high F-values indicate a strong linear relationship with the response variable.
Permutation Importance:ΒΆ
This method assesses feature importance by observing the changes in model performance when the feature values are randomly shuffled. This captures the effect of the feature on the prediction accuracy. Again, 'Jobsourcedescription' and 'JOB_TYPE_DESCRIPTION' are impactful according to this method.
RFE (Recursive Feature Elimination):ΒΆ
RFE ranks features by recursively removing the least important features based on model performance. It gives a direct ranking of features, with 'JOB_TYPE_DESCRIPTION' and 'Jobsourcedescription' ranking high.
Why Modelling Results May Not Converge:ΒΆ
Different Metrics: Each method uses a different metric to evaluate importance. For example, Random Forest looks at impurity decrease, while ANOVA F-values are based on statistical tests. These methods may prioritize different aspects of the data.
Data Characteristics: If dataset ('int') has non-linear relationships, interaction effects, or if certain features are more informative only in specific contexts, different methods will capture these aspects differently.
Noise and Correlation: If the dataset contains noise or correlated features, it can affect the feature importance scores. Some methods might be more sensitive to these issues than others.
Model-Specific Biases: Methods like Random Forest and RFE are model-dependent and may reflect biases inherent to the model (e.g., Random Forest might favor numerical features over categorical ones).
- What It Tells Us About The Dataset: The divergence in results suggests that different features play varying roles depending on the context and the type of relationship they have with the response variable.
1- It potentially indicate that the dataset has a mix of linear and non-linear relationships. 2- The presence of interaction effects or correlated features could also be influencing the results.
Next steps post Feature selection:ΒΆ
Note- -- We will do it in next steps with Random Forest and Gradient Boosting ensemble algorithms.
Model Training and Validation using Hyperparameter Tuning:ΒΆ
To compare the results across methods that will indicate features consistently ranked high across different methods are likely to be genuinely important.
Model-Specific Analysis:ΒΆ
Analyze feature importance within the context of the specific model you plan to use for prediction.
Model Experimental results and diagnosis:ΒΆ
Experimenting with different sets of features based on these importance scores and see how they affect the model's performance on unseen data.
Further Investigation and future scope: For features with divergent importance scores, further investigation (e.g., data visualization, statistical tests) could provide insights into their relationships with the response variable to select more appropriate predictors.
Alternate Modelling Techniques (To handle data non-linearity and complex relationships)ΒΆ
Random Forest Model with (11) predictors ---Model-1ΒΆ
Hyperparameter tuning using GridSeearch CrossValidation to determine the optimal hyperparametersΒΆ
Best Parameters:ΒΆ
{'model__n_estimators': 300, 'model__min_samples_leaf': 1, 'model__max_features': 'auto', 'model__max_depth': 20}
Notes -What to look for in learning curves:ΒΆ
Overfitting: If the training error is much lower than the validation error, it suggests overfitting. The model performs well on the training data but fails to generalize to unseen data.
Underfitting: If either training error exceeds validation error or both both training and validation errors are high, the model might be underfitting. It's not capturing the underlying trend in the data.
Good Fit: Ideally, both training and validation errors should converge to a low value. The gap between them should be small.
Model-1 (Random Forest Model with (11) predictors)ΒΆ
# Define predictor variables and response variable
predictors = ['JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Property Type', 'Jobsourcedescription',
'Initial Priority Description', 'Latest Priority Description', 'JOB_STATUS_DESCRIPTION',
'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC', 'SOR_DESCRIPTION', 'Mgt Area']
response = 'Total Value'
# One-hot encode categorical variables
categorical_features = predictors
one_hot = OneHotEncoder(handle_unknown='ignore')
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")
# Split the data
X = int_df_copy[predictors]
y = int_df_copy[response]
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Transform the datasets
X_train_transformed = transformer.fit_transform(X_train)
X_test_transformed = transformer.transform(X_test)
X_val_transformed = transformer.transform(X_val)
# Create a Pipeline with Random Forest model
pipeline = Pipeline([
('transformer', transformer),
('model', RandomForestRegressor(random_state=42))
])
# # Parameter distributions for Randomized Search and hyperparameter tuning
# param_distributions = {
# 'model__n_estimators': [200, 300, 400],
# 'model__max_depth': [None, 10, 20, 30],
# 'model__min_samples_leaf': [1, 5, 10],
# 'model__max_features': ['auto', 'sqrt', 'log2']
# }
param_distributions = {
'model__n_estimators': [300],
'model__max_depth': [20],
'model__min_samples_leaf': [1],
'model__max_features': ['auto']
}
# Set up K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV for hyperparameter tuning
random_search = GridSearchCV(pipeline, param_distributions, cv=kfold,
scoring='neg_mean_squared_error', verbose=2)
random_search.fit(X_train, y_train)
# Best model and parameters
best_model = random_search.best_estimator_.named_steps['model']
best_params = random_search.best_params_
print(f"Best Parameters: {best_params}")
# Predict and evaluate on training, testing, and validation sets
y_train_preds = best_model.predict(X_train_transformed)
y_test_preds = best_model.predict(X_test_transformed)
y_val_preds = best_model.predict(X_val_transformed)
# Calculate MSE for each set
train_mse = mean_squared_error(y_train, y_train_preds)
test_mse = mean_squared_error(y_test, y_test_preds)
val_mse = mean_squared_error(y_val, y_val_preds)
# Print MSE results
print(f"Training MSE: {train_mse}")
print(f"Validation MSE: {val_mse}")
print(f"Testing MSE: {test_mse}")
# Additional evaluation metrics
# Calculate RMSE for the training/testing/validation set
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_preds))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_preds))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_preds))
# Print the RMSE values
print(f"Training RMSE: {train_rmse}")
print(f"Testing RMSE: {test_rmse}")
print(f"Validation RMSE: {val_rmse}")
# Calculate MAE for training/testing/validation data
train_mae = mean_absolute_error(y_train, y_train_preds)
val_mae = mean_absolute_error(y_val, y_val_preds)
test_mae = mean_absolute_error(y_test, y_test_preds)
# Calculate RΒ² for training/testing/validation data
train_r2 = r2_score(y_train, y_train_preds)
val_r2 = r2_score(y_val, y_val_preds)
test_r2 = r2_score(y_test, y_test_preds)
# Calculate MAPE for training/testing/validation data---
# We should not use MAPE here as 7,624 zero or near-zero valuesare in the data, the MAPE calculation is likely encountering division by zero or near-zero situations, leading to the extremely high MAPE value.
# This makes MAPE an unreliable metric in this scenario.
# train_mape = mean_absolute_percentage_error(y_train, y_train_preds)
# test_mape = mean_absolute_percentage_error(y_test, y_test_preds)
# val_mape = mean_absolute_percentage_error(y_val, y_val_preds)
# Print MAE results
print(f"Training MAE: {train_mae}")
print(f"Validation MAE: {val_mae}")
print(f"Testing MAE: {test_mae}")
# Print R2 results
print(f"Training RΒ²: {train_r2}")
print(f"Validation R^2: {val_r2}")
print(f"Testing RΒ²: {test_r2}")
# Print MAPE results
# print(f"Training MAPE: {val_mape}")
# print(f"Validation MAPE: {val_mape}")
# print(f"Testing MAPE: {val_mape}")
# Residuals vs Predicted Plot
residuals = y_val - y_val_preds
plt.figure(figsize=(8, 6))
plt.scatter(y_val_preds, residuals, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Plot', fontweight='bold')
plt.show()
# Actual vs Predicted Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_val, y=y_val_preds, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Plot', fontweight='bold')
plt.show()
# Feature Importance Plot
# Fitting the OneHotEncoder separately to get the feature names
one_hot.fit(X_train[categorical_features])
feature_names = one_hot.get_feature_names_out(input_features=categorical_features)
feature_importances = best_model.feature_importances_
sorted_idx = feature_importances.argsort()
# Print sorted feature importances
print("Sorted Feature Importances:")
for idx in sorted_idx:
print(f"{feature_names[idx]}: {feature_importances[idx]}")
plt.figure(figsize=(10, 8))
plt.barh(range(len(sorted_idx)), feature_importances[sorted_idx], align='center')
plt.yticks(range(len(sorted_idx)), [feature_names[i] for i in sorted_idx])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance', fontweight='bold')
plt.show()
def plot_learning_curves(model, X_train, y_train, X_val, y_val, X_test, y_test, step=50, max_data_points=1000):
train_errors, val_errors, test_errors = [], [], []
# Use shape[0] to get the number of samples in the training set
n_train_samples = min(max_data_points, X_train.shape[0])
for m in range(1, n_train_samples, step):
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
y_test_predict = model.predict(X_test)
train_mse = mean_squared_error(y_train[:m], y_train_predict)
val_mse = mean_squared_error(y_val, y_val_predict)
test_mse = mean_squared_error(y_test, y_test_predict)
train_errors.append(train_mse)
val_errors.append(val_mse)
test_errors.append(test_mse)
plt.plot(np.sqrt(train_errors), label="Train")
plt.plot(np.sqrt(val_errors), label="Validation")
plt.plot(np.sqrt(test_errors), label="Test")
plt.xlabel("Training set size")
plt.ylabel("RMSE (Root Mean Squared Error)")
plt.legend()
plt.title("Training, Validation, and Test Loss Curves", fontweight="bold")
plt.show()
# Using the best_model that has been fitted to the entire training dataset
plot_learning_curves(best_model, X_train_transformed, y_train, X_val_transformed, y_val, X_test_transformed, y_test, step=50, max_data_points=1000)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 12.8s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 13.1s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 12.5s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 13.6s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 12.6s
Best Parameters: {'model__max_depth': 20, 'model__max_features': 'auto', 'model__min_samples_leaf': 1, 'model__n_estimators': 300}
Training MSE: 44354.144452753644
Validation MSE: 94233.42907164285
Testing MSE: 91677.01946031969
Training RMSE: 210.6042365498701
Testing RMSE: 302.78213200306203
Validation RMSE: 306.9746391343149
Training MAE: 81.95216925839884
Validation MAE: 94.95686325138155
Testing MAE: 94.22922538636342
Training RΒ²: 0.9003243914227648
Validation R^2: 0.7098245289558233
Testing RΒ²: 0.7474948095916253
Sorted Feature Importances: SOR_DESCRIPTION_STOPCOCK:OVERHAUL ANY SIZE AND TYPE: 0.0 SOR_DESCRIPTION_FLOOR:RENEW SOFTWOOD FLOOR COMPLETE (NOT FTF): 0.0 SOR_DESCRIPTION_FLOOR:RENEW SOFTWOOD UPPER FLOOR COMPLETE (NOT FTF): 0.0 SOR_DESCRIPTION_FLOORBOARD:REMOVE AND REFIX SINGLE BOARD: 0.0 SOR_DESCRIPTION_STACK:REMAKE JOINT PVCU STACK: 0.0 SOR_DESCRIPTION_FLOORBOARD:RENEW OVER 1.0SM (RTR WITHIN 3WORKDAYS) (NOT FTF): 0.0 SOR_DESCRIPTION_FLOORBOARDS:REMOVE AND REFIX AREA: 0.0 SOR_DESCRIPTION_FLOORING:RENEW IN 6MM PLYWOOD (NOT FTF): 0.0 SOR_DESCRIPTION_FLOORING:SUPPLY AND LAY 4MM PLYWOOD (NOT FTF): 0.0 SOR_DESCRIPTION_FLOORING:SUPPLY AND LAY HARDBOARD (NOT FTF): 0.0 SOR_DESCRIPTION_FLOOR:NEW INTERNAL CONCRETE WITH INSULATION (NOT FTF): 0.0 SOR_DESCRIPTION_FRAME:RAKE OUT AND REPOINT EXTERNAL SEALANT: 0.0 SOR_DESCRIPTION_FRAME:RENEW HARDWOOD CILL: 0.0 SOR_DESCRIPTION_FRAME:RENEW INTERNAL STOPS+ARCHS+DOOR REMOVAL (NOT FTF): 0.0 SOR_DESCRIPTION_FRAME:RENEW INTERNAL WITH STOPS AND ARCHITRAVES (NOT FTF): 0.0 SOR_DESCRIPTION_FRAME:SPLICE INTERNAL REPAIR: 0.0 SOR_DESCRIPTION_Fire Signs: 0.0 SOR_DESCRIPTION_Firestopping (large): 0.0 SOR_DESCRIPTION_Firestopping (medium): 0.0 SOR_DESCRIPTION_Firestopping - Firestopping fire door frame ^/ wall: 0.0 SOR_DESCRIPTION_Fixed call out cost, Monday-Friday (6pm - Midnight) ? Maintenance Engineer: 0.0 SOR_DESCRIPTION_FRAME:REFIX LOOSE EXTERNAL FRAME: 0.0 SOR_DESCRIPTION_FLOOR:CUT BACK END ROTTED JOISTS NE 150MM: 0.0 SOR_DESCRIPTION_FLOOR TILES:RENEW VINYL TILES AND SUB-BASE (NOT FTF): 0.0 SOR_DESCRIPTION_FLOOR TILES:RENEW QUARRY TILE IN PATCH: 0.0 SOR_DESCRIPTION_FIRE SYSTEM CALL POINT - FIND FAULT AND FIX: 0.0 SOR_DESCRIPTION_FIRE:REFIX FIRE SURROUND: 0.0 SOR_DESCRIPTION_FIRE:RENEW FIRECHEEKS AND ASH PIT: 0.0 SOR_DESCRIPTION_FIRE:RENEW SINGLE GLAZED TILE: 0.0 SOR_DESCRIPTION_FIXED APPLIANCE:ANNUAL TEST (Up to 5 items): 0.0 SOR_DESCRIPTION_FLAG:EXCAVATE AND LAY NEW PRECAST CONCRETE (NOT FTF): 0.0 SOR_DESCRIPTION_FLAG:RENEW SINGLE PCC PAVING (NOT FTF): 0.0 SOR_DESCRIPTION_FLASHING:RENEW LEAD APRON NE 300MM (NOT FTF): 0.0 SOR_DESCRIPTION_FLASHING:RENEW LEAD COVER NE 150MM: 0.0 SOR_DESCRIPTION_FLASHING:RENEW LEAD STEPPED NE 225MM (NOT FTF): 0.0 SOR_DESCRIPTION_FLASHING:TAKE OFF AND REFIX LEAD: 0.0 SOR_DESCRIPTION_FLOOR SPRING:OVERHAUL: 0.0 SOR_DESCRIPTION_FLOOR TILE:LAY NEW VINYL TILES AND SUB-BASE (NOT FTF): 0.0 SOR_DESCRIPTION_FLOOR TILES:HACK UP SINGLE VINYL TILE PER TILE: 0.0 SOR_DESCRIPTION_FLOOR TILES:HACK UP VINYL TILES: 0.0 SOR_DESCRIPTION_FLOOR TILES:NEW QUARRY TILE IN PATCH: 0.0 SOR_DESCRIPTION_FLOOR TILES:REBED CERAMIC TILES IN PATCH-PER TILE: 0.0 SOR_DESCRIPTION_FLOOR TILES:REBED QUARRY PER TILE: 0.0 SOR_DESCRIPTION_WALL:BOND AND SET WALL IN PATCH: 0.0 SOR_DESCRIPTION_FLOOR TILES:RENEW CERAMIC TILES IN PATCH (NOT FTF): 0.0 SOR_DESCRIPTION_FLOOR TILES:RENEW INDIVIDUAL VINYL TILE PER TILE (NOT FTF): 0.0 SOR_DESCRIPTION_Force entry: 0.0 SOR_DESCRIPTION_Force entry & change lock : 0.0 SOR_DESCRIPTION_WALL:APPLY RENDER DUB OUT IN PATCH: 0.0 SOR_DESCRIPTION_STACK:CLEAR BLOCKAGE NE 2 STOREY (RTR WITHIN 3 WORKING DAYS): 0.0 SOR_DESCRIPTION_GULLY:REMOVE AND SEAL OFF AND MAKE GOOD: 0.0 SOR_DESCRIPTION_SOCKET:RENEW SURFACE BOX: 0.0 SOR_DESCRIPTION_GULLY:RENEW GRATING: 0.0 SOR_DESCRIPTION_GULLY:RENEW INSPECTION PLATE: 0.0 SOR_DESCRIPTION_GUTTER:APPLY 2 COATS WATERPROOF TO FINLOCK: 0.0 SOR_DESCRIPTION_WALL TILES:REMOVE AND REFIX PER TILE: 0.0 SOR_DESCRIPTION_GUTTER:CLEAN OUT PRIOR TO DECORATION: 0.0 SOR_DESCRIPTION_GUTTER:FORM OUTLET (NOT FTF): 0.0 SOR_DESCRIPTION_GUTTER:FORM STOP END (NOT FTF): 0.0 SOR_DESCRIPTION_GUTTER:REALIGN CAST IRON GUTTER: 0.0 SOR_DESCRIPTION_GUTTER:REALIGN PVCU GUTTER: 0.0 SOR_DESCRIPTION_WALL TILES:RAKE OUT AND REGROUT: 0.0 SOR_DESCRIPTION_GUTTER:RENEW IN CAST IRON COMPLETE: 0.0 SOR_DESCRIPTION_GUTTER:RENEW PVCU STOP END: 0.0 SOR_DESCRIPTION_GUTTER:RENEW PVCU UNION: 0.0 SOR_DESCRIPTION_HALL:WASH REDECORATE OVER 3SM CEILING AREA (NOT FTF): 0.0 SOR_DESCRIPTION_HANDRAIL:48MM GALVANISED STEEL TUBULAR ON BRACKETS (NOT FTF): 0.0 SOR_DESCRIPTION_HANDRAIL:PVC TO CORE RAIL (NOT FTF): 0.0 SOR_DESCRIPTION_HANDRAIL:REFIX LOOSE BRACKET: 0.0 SOR_DESCRIPTION_HANDRAIL:REFIX ON BRACKETS: 0.0 SOR_DESCRIPTION_HANDRAIL:RENEW 44X69MM NEWEL TYPE (NOT FTF): 0.0 SOR_DESCRIPTION_GRAB BAR:SUPPLY VERTICAL POLE (NOT FTF): 0.0 SOR_DESCRIPTION_FIRE OR RADIANT HEATER:SERVICE ANY TYPE(RTRWITHIN14WORKDAYS): 0.0 SOR_DESCRIPTION_GRAB BAR:SUPPLY 914^/305MM BATH (NOT FTF): 0.0 SOR_DESCRIPTION_GRAB BAR:SUPPLY 305MM STRAIGHT (NOT FTF): 0.0 SOR_DESCRIPTION_WALL:APPLY DRY DASH RENDER DUB OUT IN PATCH: 0.0 SOR_DESCRIPTION_WALL:APPLY DECORATIVE RENDER DUB: 0.0 SOR_DESCRIPTION_GARDEN:CUT GRASS NE 100MM HIGH: 0.0 SOR_DESCRIPTION_WALL TILES:RENEW OR FIX INDIVIDUAL NEW GLAZED TILE: 0.0 SOR_DESCRIPTION_GARDEN:LABOUR AND SKIP FOR RUBBISH: 0.0 SOR_DESCRIPTION_GARDEN:LOWER GARDEN PRIOR TO DPC INJECTION: 0.0 SOR_DESCRIPTION_GASCOOKER DISCONNECT BAYONET SEAL MAKE SAFE(RTR 14WORKDAYS): 0.0 SOR_DESCRIPTION_GATE POST AND GATE:RENEW AND REPAIR ANY FITTING: 0.0 SOR_DESCRIPTION_GATE POST:RENEW PCC (NOT FTF): 0.0 SOR_DESCRIPTION_GATE POST:RENEW TIMBER (NOT FTF): 0.0 SOR_DESCRIPTION_GATE:INSTALL TIMBER NE 1.0SM AND POSTS (NOT FTF): 0.0 SOR_DESCRIPTION_GATE:INSTALL TIMBER NE 1.5SM AND POSTS (NOT FTF): 0.0 SOR_DESCRIPTION_GATE:RENEW 50X25MM STOP NE 2.0M: 0.0 SOR_DESCRIPTION_GATE:RENEW GATE AND FENCE CAPPING: 0.0 SOR_DESCRIPTION_GATE:RENEW METAL NE 1.0SM (NOT FTF): 0.0 SOR_DESCRIPTION_GATE:RENEW PAIR TIMBER NE 2.5SM OVERALL (NOT FTF): 0.0 SOR_DESCRIPTION_SOFFIT:RENEW IN PVCU NE 450MM (NOT FTF): 0.0 SOR_DESCRIPTION_GATE:RENEW TIMBER NE 1.5SM (NOT FTF): 0.0 SOR_DESCRIPTION_GATE:REPAIR AND EASE AND ADJUST METAL GATE: 0.0 SOR_DESCRIPTION_GATE:REPAIR AND EASE AND ADJUST TIMBER GATE: 0.0 SOR_DESCRIPTION_SOFFIT:RENEW IN PLYWOOD NE 450MM (NOT FTF): 0.0 SOR_DESCRIPTION_GRAB BAR:SUPPLY 450MM STRAIGHT (NOT FTF): 0.0 SOR_DESCRIPTION_FIRE ALARM:SERVICE ANY TYPE: 0.0 SOR_DESCRIPTION_FIRE ALARM - FIND FAULT AND FIX: 0.0 SOR_DESCRIPTION_STACK:RENEW EXTERNAL 110MM PVCU SOIL STACK: 0.0 SOR_DESCRIPTION_DRAIN:INSTALL 110MM PVCU 1-2M DEEP (NOT FTF): 0.0 SOR_DESCRIPTION_DRAIN:REPAIR 100MM PIPE NE 1M DEEP (NOT FTF): 0.0 SOR_DESCRIPTION_DRAIN:REPAIR 150MM PIPE NE 1M DEEP (NOT FTF): 0.0 SOR_DESCRIPTION_DRAIN:ROD BLOCKAGE (RTR WITHIN 12HRS): 0.0 SOR_DESCRIPTION_DRAUGHTPROOF:RENEW FOAM STRIP: 0.0 SOR_DESCRIPTION_WALL:REBUILD 1B WALL IN FACINGS (NOT FTF): 0.0 SOR_DESCRIPTION_WALL:REBUILD 1B WALL IN COMMONS (NOT FTF): 0.0 SOR_DESCRIPTION_DUCT:RENEW PLYWOOD SIDED DUCT CASING OVER 300MM (NOT FTF): 0.0 SOR_DESCRIPTION_SHOWER:RENEW 45A CEILING SWITCH: 0.0 SOR_DESCRIPTION_DWELLING:CHEMICAL DPC INJECT SURVEY: 0.0 SOR_DESCRIPTION_WALL:HACK REPLASTER DUB OUT IN PATCH: 0.0 SOR_DESCRIPTION_WALL:HACK REPLASTER DUB OUT: 0.0 SOR_DESCRIPTION_WALL:HACK OFF RENDER IN PATCH: 0.0 SOR_DESCRIPTION_STEP:REFIX AND BED LOOSE STEP: 0.0 SOR_DESCRIPTION_DWELLING:GAIN ACCESS ? 1 DOOR (RTR WITHIN 12HRS): 0.0 SOR_DESCRIPTION_STEP:FORM HALF STEP-CONCRETE: 0.0 SOR_DESCRIPTION_WALL:HACK OFF RENDER: 0.0 SOR_DESCRIPTION_ELECTRIC COOKER:DISCONNECT SEAL AND MAKE SAFE: 0.0 SOR_DESCRIPTION_ELECTRIC COOKER:RECONNECT AND TEST: 0.0 SOR_DESCRIPTION_ELECTRIC COOKER:RENEW BUILT IN OVEN UNIT (NOT FTF): 0.0 SOR_DESCRIPTION_ELECTRIC COOKER:RENEW CONTROL UNIT AND SWITCH: 0.0 SOR_DESCRIPTION_STOPCOCK:RENEW EXTERNAL BOX AND PIPE (NOT FTF): 0.0 SOR_DESCRIPTION_ELECTRIC COOKER:RENEW UNIT AND CIRCUIT: 0.0 SOR_DESCRIPTION_DRAIN:INSTALL 100MM CLAY 1-2M DEEP (NOT FTF): 0.0 SOR_DESCRIPTION_DOWNPIPE:RENEW PVCU SHOE: 0.0 SOR_DESCRIPTION_DOOR:RENEW INTERNAL PLY FLUSH ? DECORATE (NOT FTF): 0.0 SOR_DESCRIPTION_DOOR:RENEW LOCK CYLINDER TO PVCU: 0.0 SOR_DESCRIPTION_DOOR:RENEW LOCK TO STEEL: 0.0 SOR_DESCRIPTION_DOOR:RENEW MULTIPOINT LOCK: 0.0 SOR_DESCRIPTION_DOOR:RENEW MULTIPOINT LOCK TO PVCU: 0.0 SOR_DESCRIPTION_DOOR:RENEW PLYWOOD PANEL: 0.0 SOR_DESCRIPTION_DOOR:REPAIR HANDLES TO PVCU: 0.0 SOR_DESCRIPTION_DOOR:SUPPLY AND FIX SLIDING DOOR GEAR: 0.0 SOR_DESCRIPTION_DOOR:SUPPLY AND FIX WEATHERBOARD: 0.0 SOR_DESCRIPTION_DOORS:PROVIDE NEW SUITED KEYS: 0.0 SOR_DESCRIPTION_DOUBLE GLAZED UNIT:REGLAZE 28MM NE 1SM CLEAR LOW E (NOT FTF): 0.0 SOR_DESCRIPTION_DOUBLE GLAZED UNIT:REGLAZE 28MM NE 1SM SAFETY LOWE (NOT FTF): 0.0 SOR_DESCRIPTION_DOUBLE GLAZED UNIT:REGLAZE OVER 1 SM SAFETY LOW E (NOT FTF): 0.0 SOR_DESCRIPTION_DOUBLE GLAZED UNIT:REGLAZE OVER 1.0SM-CLEAR LOW E (NOT FTF): 0.0 SOR_DESCRIPTION_DOUBLE GLAZED UNIT:REGLAZE UPTO 1 SM SAFETY LOW E (NOT FTF): 0.0 SOR_DESCRIPTION_WALL:RENEW NE 12.5MM PLASTERBOARD 3MM SKIM: 0.0 SOR_DESCRIPTION_DOWNPIPE:CLEAR BLOCKAGE NE 2 STOREY: 0.0 SOR_DESCRIPTION_DOWNPIPE:REFIX LOOSE CAST IRON BRACKET: 0.0 SOR_DESCRIPTION_DOWNPIPE:REMOVE AND REFIX PVCU COMPLETE: 0.0 SOR_DESCRIPTION_DOWNPIPE:RENEW OR REFIX PVCU CLIP: 0.0 SOR_DESCRIPTION_DOWNPIPE:RENEW PVCU 2-4 STOREY: 0.0 SOR_DESCRIPTION_TURF:CLEAR BONFIRE LAY NEW TURF: 0.0 SOR_DESCRIPTION_HANDRAIL:RENEW 44X69MM ON BRACKETS (NOT FTF): 0.0 SOR_DESCRIPTION_EMERGENCY LIGHTING - FIND FAULT AND FIX: 0.0 SOR_DESCRIPTION_EXTERNAL LIGHT:INSTALL 500W PIR UNIT COMPLETE: 0.0 SOR_DESCRIPTION_FENCING:ERECT 1.2M HIGH PANEL WITH PCC POSTS (NOT FTF): 0.0 SOR_DESCRIPTION_FENCING:ERECT 1.2M HIGH PANEL WITH TIMBER POSTS (NOT FTF): 0.0 SOR_DESCRIPTION_WALL:FLOAT SET DUB OUT: 0.0 SOR_DESCRIPTION_FENCING:ERECT CHAINLINK FENCE AND POSTS 1.2M (NOT FTF): 0.0 SOR_DESCRIPTION_FENCING:ERECT POST AND WIRE 1.2M (NOT FTF): 0.0 SOR_DESCRIPTION_STAIRCASE:REFIX LOOSE BALUSTER (RTR WITHIN 3 WORKING DAYS): 0.0 SOR_DESCRIPTION_FENCING:REMOVE PCC POST AND SLAB: 0.0 SOR_DESCRIPTION_FENCING:RENEW 0.9M HIGH PANEL (NOT FTF): 0.0 SOR_DESCRIPTION_WALL:FIX NE 12.5MM PLASTERBOARD 3MM SKIM COAT: 0.0 SOR_DESCRIPTION_FENCING:RENEW 1.2M BOARD TIMBER POST (NOT FTF): 0.0 SOR_DESCRIPTION_WALL:DRY LINE NE 12.5MM PLASTERBOARD PADS AND DABS (NOT FTF): 0.0 SOR_DESCRIPTION_FENCING:RENEW 1.8M BOARD PCC POST (NOT FTF): 0.0 SOR_DESCRIPTION_FENCING:RENEW CHAINLINK ON EXISTING POSTS 1.2M (NOT FTF): 0.0 SOR_DESCRIPTION_STAIRCASE:OVERHAUL (RTR WITHIN 3 WORKING DAYS): 0.0 SOR_DESCRIPTION_TV AERIAL:REWIRE COAXIAL OVER 20M: 0.0 SOR_DESCRIPTION_FENCING:RENEW PALISADE 1.35M HIGH (NOT FTF): 0.0 SOR_DESCRIPTION_FENCING:RENEW PCC GRAVEL BOARD (NOT FTF): 0.0 SOR_DESCRIPTION_WALL:BUILD CAVITY WALL IN COMMONS (NOT FTF): 0.0 SOR_DESCRIPTION_STACK:RENEW INTERNAL PVCU WC BRANCH PIPE: 0.0 SOR_DESCRIPTION_FENCING:RESECURE LOOSE CHAINLINK FENCE: 0.0 SOR_DESCRIPTION_WALL:BUILD 225MM BLOCK WALL (NOT FTF): 0.0 SOR_DESCRIPTION_FENCE POST:RENEW PCC OVER 1.5M (NOT FTF): 0.0 SOR_DESCRIPTION_EXTERNAL LIGHT:INSTALL 1000W PIR UNIT COMPLETE: 0.0 SOR_DESCRIPTION_FENCE POST:RENEW PCC NE 1.5M (NOT FTF): 0.0 SOR_DESCRIPTION_FELT:REPAIR PATCH NE 2.0SM 2L: 0.0 SOR_DESCRIPTION_Ease and adjust existing single door: 0.0 SOR_DESCRIPTION_Estimated Costs For Miscellaneous Works: 0.0 SOR_DESCRIPTION_FAN RENEW^/FIX DOMESTIC EXT FAN GRILL(RTRWITHIN7WORKDAYS): 0.0 SOR_DESCRIPTION_FAN:INSTALL 2 SPEED CONDENSATION CONTROL: 0.0 SOR_DESCRIPTION_FAN:INSTALL WALL FAN: 0.0 SOR_DESCRIPTION_STEP:FORM BRICK-ON-EDGE STEP: 0.0 SOR_DESCRIPTION_FAN:RENEW 2 SPEED CONDENSATION CONTROL: 0.0 SOR_DESCRIPTION_FAN:RENEW 2 SPEED WALL OR DUCT FAN: 0.0 SOR_DESCRIPTION_FAN:RENEW BUILT IN WALL FAN: 0.0 SOR_DESCRIPTION_STAIRCASE:RENEW SOFTWOOD BALUSTER(RTR IN3WORKDAYS)(NOT FTF): 0.0 SOR_DESCRIPTION_FAN:RENEW VARIABLE SPEED CONDENSATION CONTROL: 0.0 SOR_DESCRIPTION_WALL:HACK OFF AND APPLY NEW SKIM COAT: 0.0 SOR_DESCRIPTION_FAN:RENEW WITH LOW VOLTAGE TRANSFORMER TYPE: 0.0 SOR_DESCRIPTION_FASCIA:RENEW IN WBP PLYWOOD NE 300MM (NOT FTF): 0.0 SOR_DESCRIPTION_FASCIAS,SOFFITS ETC:PRIME PAINT TIMBER NE 300MM (NOT FTF): 0.0 SOR_DESCRIPTION_WALL:FLOAT SET DUB OUT IN PATCH: 0.0 SOR_DESCRIPTION_FASCIA^/SOFFIT^/BARGE:REFIX: 0.0 SOR_DESCRIPTION_FASTENER:RENEW ANY TYPE OF SASH FASTENER: 0.0 SOR_DESCRIPTION_FELT:RENEW HP PLAIN OR MINERAL FINISH 2L (NOT FTF): 0.0 SOR_DESCRIPTION_STAIRCASE:REFIX LOOSE NEWEL POST: 0.0 SOR_DESCRIPTION_FELT:RENEW HP TORCH ON PLAIN OR MINERAL 2L (NOT FTF): 0.0 SOR_DESCRIPTION_FENCE POST:REFIX LOOSE POST: 0.0 SOR_DESCRIPTION_WALL:RENEW NE 12.5MM PLASTERBOARD SKIM PATCH: 0.0 SOR_DESCRIPTION_HANDRAIL:SUPPLY AND FIX MOPSTICK ON BRACKETS (NOT FTF): 0.0 SOR_DESCRIPTION_HANDRAIL:SUPPLY OR RENEW GALVANISED MS BRACKET: 0.0 SOR_DESCRIPTION_SINK TOP:REFIX ANY LOOSE SINK TOP: 0.0 SOR_DESCRIPTION_ROOM:STRIP PAINTED PAPER OVER 20SM AREA (NOT FTF): 0.0 SOR_DESCRIPTION_ROOM:STRIP PAPER OVER 20SM CEILING AREA: 0.0 SOR_DESCRIPTION_ROOM:WASH REDECORATE UPTO 20SM CEILING AREA (NOT FTF): 0.0 SOR_DESCRIPTION_ROOM:WHOLE REDECORATION ADDITION HT ABOVE 3.0M (NOT FTF): 0.0 SOR_DESCRIPTION_ROUTINE MONITORING (ANNUAL VISIT): ALL ARCHETYPES:: 0.0 SOR_DESCRIPTION_ROUTINE MONITORING 6 MONTHLY: ARCHETYPE BLOCK 21-40 UNITS: : 0.0 SOR_DESCRIPTION_ROUTINE MONITORING MONLTHY: ARCHETYPE BLOCK 21-40 UNITS: : 0.0 SOR_DESCRIPTION_ROUTINE MONITORING MONLTHY: ARCHETYPE BLOCK 3-6 UNITS:: 0.0 SOR_DESCRIPTION_ROOM:REDECORATE CEILING UPTO 20SM CEILING AREA (NOT FTF): 0.0 SOR_DESCRIPTION_ROUTINE MONITORING MONLTHY: ARCHETYPE BLOCK 7-20 UNITS: : 0.0 SOR_DESCRIPTION_ROUTINE MONITORING MONLTHY: ARCHETYPE UNKNOWN OR DOMESTIC DWELLING: 0.0 SOR_DESCRIPTION_ROUTINE MONITORING QTRLY: ARCHETYPE BLOCK 21-40 UNITS: : 0.0 SOR_DESCRIPTION_ROUTINE MONITORING QTRLY: ARCHETYPE BLOCK 7-20 UNITS: : 0.0 SOR_DESCRIPTION_Reassurance Air Test: 0.0 SOR_DESCRIPTION_Refurbishment and demolition survey - Adaptation: 0.0 SOR_DESCRIPTION_Refurbishment and demolition survey - Disrepair: 0.0 SOR_DESCRIPTION_Refurbishment and demolition survey - Planned works: 0.0 SOR_DESCRIPTION_Refurbishment and demolition survey - Responsive Repair : 0.0 SOR_DESCRIPTION_Refurbishment and demolition survey - Void (per dwelling): 0.0 SOR_DESCRIPTION_ROUTINE MONITORING MONLTHY: ARCHETYPE BLOCK OFFICE OR COMMUNITY HALL:: 0.0 SOR_DESCRIPTION_ROOM:REDECORATE 20-25SM CEILING AREA (NOT FTF): 0.0 SOR_DESCRIPTION_SINK TOP:RENEW SINGLE DRAINER TOP - TAPS: 0.0 SOR_DESCRIPTION_ROOF:SWEEP AND APPLY WP COMPOUND: 0.0 SOR_DESCRIPTION_VERGE:RAKE OUT AND REPOINT IN MORTAR: 0.0 SOR_DESCRIPTION_VENTILATOR:RENEW PVC OR ALUMINIUM: 0.0 SOR_DESCRIPTION_RADIATOR:RENEW SINGLE PANEL NE 1200X600: 0.0 SOR_DESCRIPTION_RADIATOR:TEST AND BALANCE SYSTEM: 0.0 SOR_DESCRIPTION_RENDER REPAIRS:REPAIR CRACK: 0.0 SOR_DESCRIPTION_RIDGE:RAKE OUT AND REPOINT TILES: 0.0 SOR_DESCRIPTION_RIDGE:RENEW ANY TYPE GAS FLUE TERMINAL TILE (NOT FTF): 0.0 SOR_DESCRIPTION_RIDGE:RENEW BEDDED TILE (NOT FTF): 0.0 SOR_DESCRIPTION_ROOF BOARDING:RENEW IN 19MM PLYWOOD (NOT FTF): 0.0 SOR_DESCRIPTION_SKIRTING:RENEW VINYL ETC NE 1.0LM (NOT FTF): 0.0 SOR_DESCRIPTION_ROOF TEMPORARY REPAIR SLATE OR TILE ROOF(RTRWITHIN7WORKDAYS): 0.0 SOR_DESCRIPTION_SINK:RENEW PLUG AND CHAIN: 0.0 SOR_DESCRIPTION_ROOF TILE:REMOVE AND REFIX INTERLOCKING NE 5 NO: 0.0 SOR_DESCRIPTION_ROOF TILE:REMOVE AND REFIX PLAIN NE 10 NO: 0.0 SOR_DESCRIPTION_ROOF TILE:REMOVE AND REFIX PLAIN OVER 10 NO: 0.0 SOR_DESCRIPTION_ROOF TILE:REMOVE REFIX INTERLOCKING OVER 5 NO: 0.0 SOR_DESCRIPTION_ROOF TILE:RENEW CONCRETE INTERLOCKING NE 5 NO: 0.0 SOR_DESCRIPTION_ROOF TILE:RENEW PLAIN CONCRETE OR CLAY NE 10 NO: 0.0 SOR_DESCRIPTION_ROOF:SEALING COMPOUND TO CRACKS: 0.0 SOR_DESCRIPTION_ROOF:SWEEP AND APPLY BITUMEN: 0.0 SOR_DESCRIPTION_SINK UNIT:REMOVE SINK TOP AND RENEW 1M BASE TAPS: 0.0 SOR_DESCRIPTION_Refuse Chutes Hopper Door: 0.0 SOR_DESCRIPTION_Reglaze window - single glazed (NOT FTF): 0.0 SOR_DESCRIPTION_SIGN:COLLECT AND FIX: 0.0 SOR_DESCRIPTION_VENT:RENEW EAVES VENTILATORS (NOT FTF): 0.0 SOR_DESCRIPTION_VACUUM DOUBLE GLAZING:REGLAZE UPTO 1.00SM-CLEAR (NOT FTF): 0.0 SOR_DESCRIPTION_SHOWER:RENEW SHOWER KIT: 0.0 SOR_DESCRIPTION_SHEET FLOORING:LAY NEW NON-SLIP TO STAIRCASE (NOT FTF): 0.0 SOR_DESCRIPTION_SHEET FLOORING:LAY NEW POLYSAFE AND SUB-BASE (NOT FTF): 0.0 SOR_DESCRIPTION_SHEET FLOORING:LAY NEW VINYL (NOT FTF): 0.0 SOR_DESCRIPTION_SHEET FLOORING:REFIX FLOORING: 0.0 SOR_DESCRIPTION_SHEET FLOORING:REFIX POLYSAFE: 0.0 SOR_DESCRIPTION_SHOWER:RENEW ROSE SPRAY HEAD: 0.0 SOR_DESCRIPTION_SHOWER:RENEW RAILS: 0.0 SOR_DESCRIPTION_SHEET FLOORING:RENEW VINYL (NOT FTF): 0.0 SOR_DESCRIPTION_SHEET FLOORING:RENEW VINYL AND SUB-BASE (NOT FTF): 0.0 SOR_DESCRIPTION_SHOWER SCREEN:REFIX GLASS SIDE PANEL: 0.0 SOR_DESCRIPTION_SHOWER SCREEN:REFIX OR REPAIR DOOR: 0.0 SOR_DESCRIPTION_SHOWER SCREEN:RENEW GLASS OVER BATH (NOT FTF): 0.0 SOR_DESCRIPTION_SHOWER SCREEN:RENEW THREE SIDED GLASS SCREEN (NOT FTF): 0.0 SOR_DESCRIPTION_SHOWER SCREEN:RENEW TWO SIDED GLASS SCREEN (NOT FTF): 0.0 SOR_DESCRIPTION_SHOWER:RENEW HOSE AND SPRAY: 0.0 SOR_DESCRIPTION_SHOWER:INSTALL OR RENEW AUTOMATIC PUMP KIT (NOT FTF): 0.0 SOR_DESCRIPTION_SHOWER:INSTALL OR RENEW PUMPED WASTE PREPARATION (NOT FTF): 0.0 SOR_DESCRIPTION_SHOWER:OVERHAUL SHOWER HEAD: 0.0 SOR_DESCRIPTION_SHOWER:REMOVE REPAIR AND REFIX: 0.0 SOR_DESCRIPTION_SHED DOOR:RENEW L AND B DOOR (NOT FTF): 0.0 SOR_DESCRIPTION_SLATE:REFIX FIBRE CEMENT NE 5 NO: 0.0 SOR_DESCRIPTION_SEVERE PENETRATION:DISCONNECT REPAIR TEST (RTR WITHIN 12HRS): 0.0 SOR_DESCRIPTION_SEALANT:RENEW TO BASIN OR SINK: 0.0 SOR_DESCRIPTION_Through floor lift: 0.0 SOR_DESCRIPTION_SHOWER:RENEW THERMOSTATIC MIXING VALVE (NOT FTF): 0.0 SOR_DESCRIPTION_Remove and dispose wall lining^/boxing not exceeding 1m?: 0.0 SOR_DESCRIPTION_Remove existing mortice lock, S & F Euro Thumbturn: 0.0 SOR_DESCRIPTION_Remove^/dispose TC concrete wall or ceiling, less than 1m?: 0.0 SOR_DESCRIPTION_Remove^/dispose TC plasterboard wall^/ceiling less than 1m? : 0.0 SOR_DESCRIPTION_Renew External door (Timber) (NOT FTF): 0.0 SOR_DESCRIPTION_Repair or replace lock to door entry system: 0.0 SOR_DESCRIPTION_Repairs Call Out Rate: Price to attend on-site Emergency Out: 0.0 SOR_DESCRIPTION_Replace WC (NOT FTF): 0.0 SOR_DESCRIPTION_Replace kitchen drawer fronts (NOT FTF): 0.0 SOR_DESCRIPTION_Replace shower (NOT FTF): 0.0 SOR_DESCRIPTION_S&F Meter Cupboard (Small): 0.0 SOR_DESCRIPTION_VENT:INSTALL TUMBLE VENT KIT (NOT FTF): 0.0 SOR_DESCRIPTION_VENT UNIT:OVERHAUL: 0.0 SOR_DESCRIPTION_VENT INSTALL NEW INT PLASTIC GRILL TO REP PLASTER(NOT FTF): 0.0 SOR_DESCRIPTION_SCREED:LATEX SELF LEVEL: 0.0 SOR_DESCRIPTION_SCREED:LAY 40MM THICK IN PATCH: 0.0 SOR_DESCRIPTION_SCREED:LAY 40MM THICK SCREED: 0.0 SOR_DESCRIPTION_SEALANT:APPLY TO WORKTOP AND WALL: 0.0 SOR_DESCRIPTION_SEALANT:PVC TRIM TO BATH: 0.0 SOR_DESCRIPTION_SHOWER:RENEW SHOWER SLIDE BAR: 0.0 SOR_DESCRIPTION_Play area: Find fault and fix (routine): 0.0 SOR_DESCRIPTION_Play Area: Quoted Work: 0.0 SOR_DESCRIPTION_Plaster patch - (per sq m) (NOT FTF): 0.0 SOR_DESCRIPTION_LATCH:OVERHAUL LATCH AND FURNITURE: 0.0 SOR_DESCRIPTION_VERGE:RENEW OR REFIX DRY VERGE RIDGE OR END STOP: 0.0 SOR_DESCRIPTION_LETTERPLATE:RENEW FIREPROOF TYPE: 0.0 SOR_DESCRIPTION_LETTERPLATE:RENEW OR SUPPLY HIGH SECURITY COWL: 0.0 SOR_DESCRIPTION_LETTERPLATE:SUPPLY AND FIX NEW: 0.0 SOR_DESCRIPTION_LIFT CONSULTANCY: LOLER INSPECTION COMMUNAL PASSENGER LIFT: 0.0 SOR_DESCRIPTION_LIFT CONSULTANCY: LOLER INSPECTION COMMUNAL STAIRLIFT: 0.0 SOR_DESCRIPTION_LIFT CONSULTANCY: LOLER INSPECTION DOMESTIC HOIST OR SIMILAR: 0.0 SOR_DESCRIPTION_LIFT CONSULTANCY: LOLER INSPECTION DOMESTIC STAIRLIFT: 0.0 SOR_DESCRIPTION_LIFT:RENEW SASH LIFT: 0.0 SOR_DESCRIPTION_LIGHT FITTING:OVERHAUL FLUORESCENT ANY TYPE: 0.0 SOR_DESCRIPTION_LIGHT FITTING:REMOVE AND REFIX ANY TYPE: 0.0 SOR_DESCRIPTION_LIGHT FITTING:REMOVE REFIX ANY FLUORESCENT TYPE: 0.0 SOR_DESCRIPTION_LIGHT FITTING:RENEW 16W LV BULKHEAD TYPE: 0.0 SOR_DESCRIPTION_LIGHT FITTING:RENEW 2X8W BULKHEAD TYPE: 0.0 SOR_DESCRIPTION_LIGHT FITTING:RENEW ANY SIZE FLUORESCENT TUBE: 0.0 SOR_DESCRIPTION_LIGHT FITTING:RENEW SINGLE FLUORESCENT WITH TUBE: 0.0 SOR_DESCRIPTION_LIGHT:RENEW FLEX LAMPHOLDER ROSE: 0.0 SOR_DESCRIPTION_LIGHT:TEMPORARY CONNECTION: 0.0 SOR_DESCRIPTION_LIGHTING COLUMN:OVERHAUL BOLLARD TYPE ? BALLAST: 0.0 SOR_DESCRIPTION_LINTEL:RENEW WITH CATNIC NE 2.5M LONG (NOT FTF): 0.0 SOR_DESCRIPTION_LAMP:RENEW NE 100W BULKHEAD LAMP: 0.0 SOR_DESCRIPTION_LOBBY:REDECORATE CEILINGS NE 3SM AREA (NOT FTF): 0.0 SOR_DESCRIPTION_KITCHEN UNIT:RENEW WALL UNIT DOOR (NOT FTF): 0.0 SOR_DESCRIPTION_SOCKET:RENEW 13A DOUBLE PLATE: 0.0 SOR_DESCRIPTION_HANDRAIL:SUPPLY PROPRIETARY TO WALL (NOT FTF): 0.0 SOR_DESCRIPTION_HARDCORE:ADDITIONAL SUB-BASE OR BED NE 150 (NOT FTF): 0.0 SOR_DESCRIPTION_HEATER SERVICE ANY TYPE OF STORAGE HEATER(RTR IN14WORKDAYS): 0.0 SOR_DESCRIPTION_HINGES:RENEW 63MM STORMPROOF: 0.0 SOR_DESCRIPTION_HOLE:HOLE FOR CAVITY INSPECTION: 0.0 SOR_DESCRIPTION_SOCKET:RENEW SINGLE OUTLET PLATE AND BOX: 0.0 SOR_DESCRIPTION_HOPPER:CLEAR OUT BLOCKED HOPPERHEAD: 0.0 SOR_DESCRIPTION_High Risk Action for Extraction: 0.0 SOR_DESCRIPTION_IMMERSION HEATER:TEST AND RESET THERMOSTAT: 0.0 SOR_DESCRIPTION_INSPECT:LANDLORDS LIGHTS NE 4 FLOORS: 0.0 SOR_DESCRIPTION_INSPECTION COVER:RENEW-300X300MM (NOT FTF): 0.0 SOR_DESCRIPTION_SOCKET:RENEW DOUBLE OUTLET PLATE AND BOX: 0.0 SOR_DESCRIPTION_SOCKET:RENEW 13A DP PLATE WITH INDICATOR: 0.0 SOR_DESCRIPTION_KEYSAFE:PROVIDE : 0.0 SOR_DESCRIPTION_KITCHEN UNIT:OVERHAUL ANY TYPE: 0.0 SOR_DESCRIPTION_WALL 113MM:CHEMICAL INJECTION INTERNAL DPC (NOT FTF): 0.0 SOR_DESCRIPTION_KITCHEN UNIT:RENEW BASE UNIT DOOR (NOT FTF): 0.0 SOR_DESCRIPTION_KITCHEN UNIT:RENEW DRAWER BOX COMPLETE (NOT FTF): 0.0 SOR_DESCRIPTION_VERTICAL COVERING:EXTRA TO RENEW FELT AND BATTENS (NOT FTF): 0.0 SOR_DESCRIPTION_KITCHEN UNIT:RENEW SHELF TO UNIT (NOT FTF): 0.0 SOR_DESCRIPTION_KITCHEN UNIT:RENEW SINGLE BASE 300X600 (NOT FTF): 0.0 SOR_DESCRIPTION_KITCHEN UNIT:RENEW SINGLE BASE TO MATCH EXISTING (NOT FTF): 0.0 SOR_DESCRIPTION_HANDRAIL:SUPPLY MOPSTICK INCLUDING BRACKETS (NOT FTF): 0.0 SOR_DESCRIPTION_LOBBY:REDECORATE COMPLETE NE 3SM CEILING AREA (NOT FTF): 0.0 SOR_DESCRIPTION_LOCK:FULL LOCK CHANGE ? FRONT AND REAR DOOR MULTI: 0.0 SOR_DESCRIPTION_SMOKE DETECTOR:SERVICE AND OVERHAUL HARD WIRE: 0.0 SOR_DESCRIPTION_VERGE:REFIX DRY OR CLOAKED VERGE COMPLETE: 0.0 SOR_DESCRIPTION_PATH:REMOVE AND INFILL WITH TOPSOIL (NOT FTF): 0.0 SOR_DESCRIPTION_PATH:RENEW NE 100MM CONCRETE BED AND SUB-BASE (NOT FTF): 0.0 SOR_DESCRIPTION_PAVING:REBED BRICK PAVING-MORTAR OR SAND: 0.0 SOR_DESCRIPTION_PAVING:RENEW GRAVEL PAVING 30MM (NOT FTF): 0.0 SOR_DESCRIPTION_PIPE:RENEW OR INSTALL 15MM COPPER: 0.0 SOR_DESCRIPTION_SLIDING DOOR GEAR:RENEW: 0.0 SOR_DESCRIPTION_PLASTER REPAIR:RENEW PLASTER VENT: 0.0 SOR_DESCRIPTION_PLASTER REPAIR:RENEW REVEAL TO FRAME: 0.0 SOR_DESCRIPTION_SLATE:RENEW FIBRE CEMENT NE 5 NO: 0.0 SOR_DESCRIPTION_PLASTER REPAIR:REPAIR CRACKS AROUND FRAME: 0.0 SOR_DESCRIPTION_PORCH:RENEW LEAD COVERING TO PORCH (NOT FTF): 0.0 SOR_DESCRIPTION_PORTABLE APPLIANCE:ANNUAL TEST (up to 50 items): 0.0 SOR_DESCRIPTION_POWER AND LIGHT:TEMPORARY CONNECTION: 0.0 SOR_DESCRIPTION_POWER:NEW SPUR OUTLET GROUND FLOOR NE 10M: 0.0 SOR_DESCRIPTION_POWER:RENEW RING MAIN GROUND NE 4 NO SOCKETS: 0.0 SOR_DESCRIPTION_POWER:RENEW SOCKET OUTLET-GROUND FLOOR: 0.0 SOR_DESCRIPTION_POWER:RENEW SOCKET OUTLET-UPPER FLOOR: 0.0 SOR_DESCRIPTION_PUMP:RENEW BOOSTED SHOWER PUMP (NOT FTF): 0.0 SOR_DESCRIPTION_PUMP:SERVICE AND OVERHAUL: 0.0 SOR_DESCRIPTION_PANELLING:RENEW IN HARDBOARD (NOT FTF): 0.0 SOR_DESCRIPTION_LOCK:FULL LOCK CHANGE ? FRONT AND REAR DOOR: 0.0 SOR_DESCRIPTION_PANE:REGLAZE 7MM GWCG UPTO 1.00SM (NOT FTF): 0.0 SOR_DESCRIPTION_PANE:REGLAZE 6.4MM LAMINATED UPTO 1.00SM (NOT FTF): 0.0 SOR_DESCRIPTION_LOCK:FULL LOCK CHANGE ? FRONT DOOR: 0.0 SOR_DESCRIPTION_LOCK:RENEW EUROLOCK COMPLETE: 0.0 SOR_DESCRIPTION_LOCK:RENEW MORTICE COMPLETE: 0.0 SOR_DESCRIPTION_LOCK:RENEW MORTICE KEEP: 0.0 SOR_DESCRIPTION_LOCK:RENEW PATIO DOOR LOCK COMPLETE: 0.0 SOR_DESCRIPTION_Low Risk Action for Extraction: 0.0 SOR_DESCRIPTION_MACADAM:RENEW 70MM PAVING (NOT FTF): 0.0 SOR_DESCRIPTION_METER CUPBOARD:RENEW DOOR: 0.0 SOR_DESCRIPTION_METER CUPBOARD:RENEW OR SUPPLY AND FIX NEW (NOT FTF): 0.0 SOR_DESCRIPTION_MIXER:RENEW THERMOSTATIC TO BATH (NOT FTF): 0.0 SOR_DESCRIPTION_MIXER:RENEW THERMOSTATIC TO BATH SHOWER ATTACHMENT (NOT FTF): 0.0 SOR_DESCRIPTION_VERGE:REMOVE AND REFIX TILE: 0.0 SOR_DESCRIPTION_Medium Risk Action - Extraction Use only: 0.0 SOR_DESCRIPTION_Medium Risk Action for Extraction: 0.0 SOR_DESCRIPTION_Miscellaneous Works: 0.0 SOR_DESCRIPTION_NIGHTLATCH:RENEW COMPLETE: 0.0 SOR_DESCRIPTION_NIGHTLATCH:RENEW CYLINDER BARREL: 0.0 SOR_DESCRIPTION_NOSING:REFIX TO STEP: 0.0 SOR_DESCRIPTION_OPENING:BOARD UP WITH 12MM STERLING OR PLY(RTRWITHIN12HRS): 0.0 SOR_DESCRIPTION_OPENING:REMOVE BOARDING TO OPENINGS (RTR WITHIN 14 WORKDAYS): 0.0 SOR_DESCRIPTION_OVERHAUL^/REPAIR DEFECTIVE DOOR ENTRY INTERCOM: 0.0 SOR_DESCRIPTION_PANE:REGLAZE 6MM CLEAR OR OBSCURE UPTO 1.00SM (NOT FTF): 0.0 SOR_DESCRIPTION_DOOR:RENEW INTERNAL EMBOSSED PANELLED ? DECORATE (NOT FTF): 0.0 SOR_DESCRIPTION_SHOWER:RENEW CURTAIN: 0.0 SOR_DESCRIPTION_WALL:REPAIR FRACTURE: 0.0 SOR_DESCRIPTION_WATER HAMMER:CLEAR AND REMEDY AIRLOCK: 0.0 SOR_DESCRIPTION_WASTE:REPAIR LEAK ON COPPER WASTE: 0.0 JOB_TYPE_DESCRIPTION_Lifts Consultants: 0.0 SOR_DESCRIPTION_DOOR:RENEW INTERNAL EMBOSSED PANELLED (NOT FTF): 0.0 TRADE_DESCRIPTION_Water: 0.0 ABANDON_REASON_DESC_Abortive Call: 0.0 ABANDON_REASON_DESC_Added to Planned Programme: 0.0 SOR_DESCRIPTION_WATERBAR:RENEW: 0.0 SOR_DESCRIPTION_Servicing: Block: 21-40 Units - Automatic Opening Vents: 0.0 SOR_DESCRIPTION_Servicing : Other - Fire Detection System: 0.0 ABANDON_REASON_DESC_No Access: 0.0 ABANDON_REASON_DESC_No Charge: 0.0 SOR_DESCRIPTION_WASTE:RENEW 40MM PIPE AND TRAP SINK: 0.0 ABANDON_REASON_DESC_Riverside Not Approved: 0.0 ABANDON_REASON_DESC_See Repair Memo: 0.0 ABANDON_REASON_DESC_Tenant Refusal: 0.0 ABANDON_REASON_DESC_Data Clean Up: 0.0 JOB_TYPE_DESCRIPTION_Lightning Conductors and Fall Safety Rep: 0.0 JOB_TYPE_DESCRIPTION_PAT Testing: 0.0 JOB_TYPE_DESCRIPTION_Play Equipment Repairs: 0.0 TRADE_DESCRIPTION_Disabled Adaptations: 0.0 TRADE_DESCRIPTION_Door Access Control: 0.0 SOR_DESCRIPTION_Servicing: Block: 21-40 Units - Fire Detection System: 0.0 SOR_DESCRIPTION_WC SUITE:INSTALL CLOSE COUPLED SPECIAL NEEDS TYPE (NOT FTF): 0.0 SOR_DESCRIPTION_WC PAN:RENEW SEAT COMPLETE: 0.0 TRADE_DESCRIPTION_Fire: 0.0 SOR_DESCRIPTION_WC PAN:RENEW FLUSH PIPE: 0.0 SOR_DESCRIPTION_WC PAN:OVERHAUL ANY TYPE: 0.0 SOR_DESCRIPTION_Servicing: Block: 21-40 Units - Emergency Lighting: 0.0 TRADE_DESCRIPTION_Inspection: 0.0 JOB_TYPE_DESCRIPTION_Schedule Repairs Visit: 0.0 SOR_DESCRIPTION_TEMP GLAZING FIX PRIOR TO REPLACEMENT(RTRWITHIN14WORKDAYS): 0.0 SOR_DESCRIPTION_WC CISTERN:RENEW OVERFLOW: 0.0 SOR_DESCRIPTION_WC CISTERN:RENEW LOW LEVEL PLASTIC: 0.0 TRADE_DESCRIPTION_Play and Recreation: 0.0 SOR_DESCRIPTION_WC CISTERN:REFIX INCLUDING RENEW BRACKET: 0.0 SOR_DESCRIPTION_WC CISTERN:OVERHAUL ANY TYPE (RTR WITHIN 3 WORKING DAYS): 0.0 ABANDON_REASON_DESC_Testing: 0.0 SOR_DESCRIPTION_WC SUITE:RENEW CLOSE COUPLED: 0.0 ABANDON_REASON_DESC_Work Under Guarantee: 0.0 SOR_DESCRIPTION_WASTE:RENEW 40MM PIPE AND TRAP BATH: 0.0 SOR_DESCRIPTION_ASB Planned works -Medium Kitchen : 0.0 SOR_DESCRIPTION_ASB Planned works -Pitched Roof : 0.0 SOR_DESCRIPTION_ASBESTOS MANAGEMENT SURVEY (NOT FTF): 0.0 SOR_DESCRIPTION_ASBESTOS R & D SURVEY (NOT FTF): 0.0 SOR_DESCRIPTION_ASPHALT:MAKE GOOD CRACK OVER 1.0M: 0.0 SOR_DESCRIPTION_ASPHALT:RENEW 20MM IN PATCH NE 2.0SM (NOT FTF): 0.0 SOR_DESCRIPTION_AUTOMATIC DOORS:RESPONSE CALLOUT ( In Hours): 0.0 SOR_DESCRIPTION_ASB Planned works -Medium Kitchen: 0.0 SOR_DESCRIPTION_AUTOMATIC OPENING VENTS - FIND FAULT AND FIX: 0.0 SOR_DESCRIPTION_Annual Automattic Gate Service: 0.0 SOR_DESCRIPTION_Annual Fully Comprehensive Maintenance Cost - Hydraulic: 0.0 SOR_DESCRIPTION_Annual Fully Comprehensive Maintenance Cost - MRL ? max 4 fl: 0.0 SOR_DESCRIPTION_Annual Subscription Payment to Alarm Receiving Centre (ARC): 0.0 SOR_DESCRIPTION_Annual inspection^/service to fixed fall arrest system: 0.0 SOR_DESCRIPTION_Asb Action: Medium Risk action: 0.0 SOR_DESCRIPTION_Attend site and carry out 12M service to Access control: 0.0 SOR_DESCRIPTION_Annual Audit: 0.0 SOR_DESCRIPTION_ASB Planned works -Large Kitchen : 0.0 SOR_DESCRIPTION_ASB Planned works -GN Comm Door: 0.0 SOR_DESCRIPTION_APPLIANCE:DISCONNECT AND RECONNECT: 0.0 SOR_DESCRIPTION_ACM Re-inspection(s) Domestic: 0.0 SOR_DESCRIPTION_AIRBRICK:INSTALL NEW CLAY OR CONCRETE VENT: 0.0 SOR_DESCRIPTION_Servicing : Block: 7-20 Units - Fire Detection System: 0.0 SOR_DESCRIPTION_AIRBRICK:REBED LOOSE VENT: 0.0 SOR_DESCRIPTION_AIRBRICK:RENEW CLAY OR CONCRETE VENT: 0.0 SOR_DESCRIPTION_AIRBRICK:RENEW WITH PVC: 0.0 SOR_DESCRIPTION_AMS - Dwelling - Adaptation (per dwelling): 0.0 SOR_DESCRIPTION_AMS - Dwelling - Void (per dwelling): 0.0 SOR_DESCRIPTION_AMS with Two Targeted area- Dwelling - Adaptaion: 0.0 SOR_DESCRIPTION_AMS with Two Targeted area- Dwelling - Responsive Repair : 0.0 SOR_DESCRIPTION_AMS with Two Targeted area- Dwelling - Void (per dwelling): 0.0 CONTRACTOR_Contractor 31: 0.0 SOR_DESCRIPTION_AMS with one Targeted area- Dwelling - Disrepair: 0.0 SOR_DESCRIPTION_AMS with one Targeted area- Dwelling - Home improve request: 0.0 SOR_DESCRIPTION_AMS with one Targeted area- Dwelling - R R : 0.0 SOR_DESCRIPTION_AMS with one Targeted area- Dwelling - Void (per dwelling): 0.0 SOR_DESCRIPTION_AMS with one Targeted area- Dwelling - per Block : 0.0 SOR_DESCRIPTION_TEST:OCCUPIED PROPERTY POST REPAIRS CERTIFICATE: 0.0 SOR_DESCRIPTION_Attend site and carry out 6 Monthly service - Single Swing : 0.0 SOR_DESCRIPTION_WEATHERSTRIP:FIX AA TO DOOR AND FRAME: 0.0 TRADE_DESCRIPTION_Asbestos: 0.0 SOR_DESCRIPTION_WINDOW:RENEW RESTRICTOR STAY TO PVCU: 0.0 SOR_DESCRIPTION_TANK:REPAIR LEAK TO TANK OR FITTING (RTR WITHIN 3 WORK DAYS): 0.0 CONTRACTOR_Contractor 20: 0.0 SOR_DESCRIPTION_WINDOW:RENEW QUADRANT FILLET TO PVCU: 0.0 SOR_DESCRIPTION_WINDOW:RENEW LOCKING HANDLE TO PVCU LOCKING PLATE: 0.0 CONTRACTOR_Contractor 2: 0.0 SOR_DESCRIPTION_WINDOW:RENEW HANDLE TO PVCU: 0.0 SOR_DESCRIPTION_TAPS:CONVERT BASIN TO LEVERS - PAIR: 0.0 SOR_DESCRIPTION_WINDOW:RENEW GLAZING BEAD TO PVCU: 0.0 SOR_DESCRIPTION_WINDOW:RENEW ESPAGNOLETTE LOCK TO PVCU(RTRWITHIN14WORKDAYS): 0.0 CONTRACTOR_Contractor 17: 0.0 SOR_DESCRIPTION_WINDOW:RENEW CILL TO PVCU: 0.0 Initial Priority Description_112 Calendar Days - Compliance: 0.0 Initial Priority Description_12 Calendar Hours: 0.0 SOR_DESCRIPTION_TANK:REPAIR BALLVALVE AND FLOAT: 0.0 CONTRACTOR_Contractor 16: 0.0 CONTRACTOR_Contractor 18: 0.0 SOR_DESCRIPTION_WINDOW:RENEW SEALING GASKET TO PVCU: 0.0 CONTRACTOR_Contractor 25: 0.0 CONTRACTOR_Contractor 26: 0.0 CONTRACTOR_Contractor 4: 0.0 SOR_DESCRIPTION_WINDOW:RENEW WITH PVCU TILT TURN 4 LIGHT (NOT FTF): 0.0 SOR_DESCRIPTION_TAP:RENEW PAIR HIGHNECK SINK PILLAR TAPS: 0.0 SOR_DESCRIPTION_TAP:RENEW PAIR BASIN PILLAR TAPS: 0.0 CONTRACTOR_Contractor 9: 0.0 SOR_DESCRIPTION_TAP:OVERHAUL ANY TYPE OF TAP (RTR WITHIN 3 WORKING DAYS): 0.0 SOR_DESCRIPTION_WINDOW:RENEW WITH PVCU TILT TURN 3 LIGHT (NOT FTF): 0.0 SOR_DESCRIPTION_WINDOW:RENEW WITH PVCU CASEMENT 4 LIGHT (NOT FTF): 0.0 SOR_DESCRIPTION_TAP:OVERHAUL ANY TYPE OF MIXER: 0.0 SOR_DESCRIPTION_WINDOW:RENEW WITH PVCU CASEMENT 3 LIGHT (NOT FTF): 0.0 CONTRACTOR_Contractor 29: 0.0 SOR_DESCRIPTION_TANKING:INSTALL ASPHALT DPM VERTICAL (NOT FTF): 0.0 SOR_DESCRIPTION_WINDOW:RENEW WEATHER DRAUGHT PROOFING METAL: 0.0 SOR_DESCRIPTION_WINDOW:RENEW SOFTWOOD PARTING OR STAFF BEAD: 0.0 CONTRACTOR_Contractor 28: 0.0 CONTRACTOR_Contractor 27: 0.0 SOR_DESCRIPTION_WINDOW:RENEW SET OF SASH CORDS: 0.0 Initial Priority Description_38 Calendar Days - Compliance: 0.0 SOR_DESCRIPTION_WINDOWS^/DOORS:PVCU 3 BED FLAT CHECK CLEAN: 0.0 CONTRACTOR_Contractor 14: 0.0 SOR_DESCRIPTION_TREE:DIG OUT SEEDLING UPTO 150 MM GIRTH: 0.0 SOR_DESCRIPTION_WINDOW:EASE AND ADJUST PVCU SASH (RTR WITHIN 7 WORKING DAYS): 0.0 SOR_DESCRIPTION_WINDOW:EASE AND ADJUST INCLUDING REMOVE (RTRWITHIN3WORKDAYS): 0.0 SOR_DESCRIPTION_Servicing: Block: 6 Units and under - Emergency Lighting: 0.0 JOB_TYPE_DESCRIPTION_Warden Call Equipment Repairs: 0.0 Latest Priority Description_Emergency Health and Safety: 0.0 SOR_DESCRIPTION_WINDOW FRAME:RAKE OUT AND REPOINT: 0.0 SOR_DESCRIPTION_WINDOW FITTINGS:PROVIDE NEW KEYS: 0.0 SOR_DESCRIPTION_Servicing: Block: 6 Units and under Automatic Opening Vents: 0.0 SOR_DESCRIPTION_Servicing: Block: 21-40 Units - Fire Extinguishers^/Blankets: 0.0 JOB_TYPE_DESCRIPTION_Tenant Doing Own Repair: 0.0 SOR_DESCRIPTION_WINDOW FITTING REFIX EASE ADJUST ANY TYPE(RTR IN14WORKDAYS): 0.0 SOR_DESCRIPTION_WINDOW CILL:CUT OUT AND SPLICE NEW OVER 300MM: 0.0 JOB_STATUS_DESCRIPTION_Note Job: 0.0 SOR_DESCRIPTION_WINDOW CILL:CUT OUT AND SPLICE NEW NE 300MM: 0.0 TRADE_DESCRIPTION_: 0.0 SOR_DESCRIPTION_WINDOWS^/DOORS:PVCU 2 BED FLAT CHECK CLEAN: 0.0 SOR_DESCRIPTION_TAPS:RENEW WITH SINK LEVERS - PAIR (NOT FTF): 0.0 SOR_DESCRIPTION_WINDOWBOARD:RENEW SOFTWOOD (NOT FTF): 0.0 JOB_TYPE_DESCRIPTION_XXXXXXAsbestos Inspections: 0.0 SOR_DESCRIPTION_Servicing: Block: 7-20 Units - Automatic Opening Vents: 0.0 SOR_DESCRIPTION_WINDOW:REFIX LOOSE GLAZING BEAD: 0.0 SOR_DESCRIPTION_WINDOW:PROVIDE DRIP MOULD BEAD TO PVCU: 0.0 SOR_DESCRIPTION_Straight Stairlift ^/ Curved stairlift: 0.0 SOR_DESCRIPTION_Smoke Detector Ai 3016: 0.0 Initial Priority Description_Emergency Health and Safety: 0.0 Initial Priority Description_Health & Safety - Compliance - 4 Hours: 0.0 SOR_DESCRIPTION_WINDOW:PROVIDE CILL TO PVCU: 0.0 SOR_DESCRIPTION_Servicing: Other - Fire Extinguishers^/Blankets: 0.0 SOR_DESCRIPTION_WINDOW:OVERHAUL METAL: 0.0 SOR_DESCRIPTION_WINDOW:OVERHAUL CASEMENT: 0.0 SOR_DESCRIPTION_WINDOW:EXTRA FOR DOUBLE GLAZING IN REPAIRS (NOT FTF): 0.0 Initial Priority Description_Urgent GAS - 3 Working Days: 0.0 SOR_DESCRIPTION_WINDOW:REPOINT SILICONE TO PVCU FRAME: 0.0 SOR_DESCRIPTION_Servicing: Other - Emergency Lighting: 0.0 SOR_DESCRIPTION_TAPS:RENEW KITCHEN SINK PAIR LOW FLOW 3-4 L^/S (NOT FTF): 0.0 SOR_DESCRIPTION_WINDOW:EASE OIL BUTTS ADJUST ANY METAL(RTRWITHIN14WORK DAYS): 0.0 SOR_DESCRIPTION_Servicing: Block: 7-20 Units - Emergency Lighting: 0.0 CONTRACTOR_Contractor 13: 0.0 SOR_DESCRIPTION_TREE:CUT DOWN GIRTH UPTO 450MM: 0.0 SOR_DESCRIPTION_AMS with one Targeted area- Dwelling - Adaptation: 0.0 SOR_DESCRIPTION_CILL:RENEW OR SUPPLY AND FIX STORMGUARD CILL: 0.0 SOR_DESCRIPTION_CHIMNEY:REBUILD 1 COURSE 1 FLUE (NOT FTF): 0.0 SOR_DESCRIPTION_CHIMNEY:REBUILD 4 COURSE 2 FLUE (NOT FTF): 0.0 SOR_DESCRIPTION_CHIMNEY:REMOVE AND REFIX TV AERIAL OR DISH: 0.0 SOR_DESCRIPTION_CHIMNEY:SEAL FLUE: 0.0 SOR_DESCRIPTION_CILL:MAKE GOOD DAMAGED CONCRETE CILL: 0.0 SOR_DESCRIPTION_CILL:REBED INDIVIDUAL BRICK TO CILL: 0.0 SOR_DESCRIPTION_CILL:REFIX STORMGUARD THRESHOLD CILL: 0.0 SOR_DESCRIPTION_BASEMENT:CLEAR OUT COMPLETE: 0.0 SOR_DESCRIPTION_WALLS:WASH APPLY SEALER 2 COATS MASONRY PAINT (NOT FTF): 0.0 SOR_DESCRIPTION_SWITCH OR OUTLET:SECURE LOOSE: 0.0 SOR_DESCRIPTION_CLIENT INSPECTION:DRAINAGE: 0.0 JOB_TYPE_DESCRIPTION_Asbestos Repairs Communal: 0.0 JOB_TYPE_DESCRIPTION_Asbestos Inspection Communal: 0.0 JOB_TYPE_DESCRIPTION_Asbestos Inspections Void: 0.0 SOR_DESCRIPTION_SWITCH OR OUTLET:REMOVE AND REFIX: 0.0 SOR_DESCRIPTION_SWITCH:RENEW 5 AMP NE 3 GANG PLATE: 0.0 SOR_DESCRIPTION_DOOR FURNITURE:SUPPLY AND FIX KICKING PLATE: 0.0 SOR_DESCRIPTION_CHIMNEY:450MM VENTED CAP TO POT (NOT FTF): 0.0 SOR_DESCRIPTION_CHECK VALVE:RENEW OR INSTALL 15MM DIAMETER: 0.0 SOR_DESCRIPTION_SWITCH:RENEW PULL SWITCH CORD: 0.0 SOR_DESCRIPTION_CEILING:BOND AND FINISH: 0.0 SOR_DESCRIPTION_CEILING:BOND AND FINISH IN PATCH: 0.0 SOR_DESCRIPTION_CEILING:FIX DOUBLE NE 12.5MM PLASTERBOARD 3MM SKIM: 0.0 SOR_DESCRIPTION_DOOR:OVERHAUL PVCU: 0.0 SOR_DESCRIPTION_CEILING:FIX DOUBLE NE 12.5MM PLASTERBOARD IN PATCH: 0.0 SOR_DESCRIPTION_CEILING:FIX NE 12.5MM PLASTERBOARD 3MM SKIM PATCH: 0.0 SOR_DESCRIPTION_CLIENT INSPECTION:FENCING: 0.0 SOR_DESCRIPTION_DOOR:OVERHAUL MULTIPOINT LOCK TO PVCU: 0.0 SOR_DESCRIPTION_CEILING:HACK RENEW PLASTER IN PATCH: 0.0 SOR_DESCRIPTION_CEILING:REMOVE COLLAPSED CEILING AFTER WATER LEAK: 0.0 SOR_DESCRIPTION_CEILING:RENEW APPLY SKIM COAT: 0.0 SOR_DESCRIPTION_CEILING:RENEW APPLY SKIM COAT IN PATCH: 0.0 SOR_DESCRIPTION_WASHING MACHINE:RENEW INDIVIDUAL VALVE: 0.0 SOR_DESCRIPTION_CEILING:RENEW NE 12.5MM PLASTERBOARD SKIM IN PATCH: 0.0 SOR_DESCRIPTION_WASHING MACHINE:FORM NEW WASTE OUTLET: 0.0 SOR_DESCRIPTION_SWITCH:RENEW CEILING PULL SWITCH: 0.0 SOR_DESCRIPTION_SURFACES:REMOVE GRAFFITI RINSE DRY: 0.0 SOR_DESCRIPTION_DOOR:OVERHAUL EXTERNAL COMPLETE: 0.0 SOR_DESCRIPTION_CLIENT INSPECTION:PROVIDE AND ERECT LADDER: 0.0 SOR_DESCRIPTION_Carry out LRA on any type of building: 0.0 SOR_DESCRIPTION_Check Gas Heating (NOT FTF): 0.0 SOR_DESCRIPTION_Commercial Engineer- Hourly Rate - In Hours: 0.0 SOR_DESCRIPTION_Commercial MAJOR SERVICE - Commercial Plant Communal Boiler: 0.0 SOR_DESCRIPTION_WALLS:HANG WALLPAPER IN REPAIR: 0.0 SOR_DESCRIPTION_Common Area Store or Riser Fire Doors Single upgrade : 0.0 SOR_DESCRIPTION_Compartmentation (Medium): 0.0 SOR_DESCRIPTION_Carry out LRA on any tupe of building: 0.0 SOR_DESCRIPTION_WALL:REPAIR SMALL PATCH IN COMMONS: 0.0 SOR_DESCRIPTION_DOMELIGHT:REMOVE AND REFIX: 0.0 JOB_TYPE_DESCRIPTION_Asbestos Inspection Reactive: 0.0 SOR_DESCRIPTION_DOMESTIC APPLIANCE:RESPONSE AND REPAIR UPTO ?85.00: 0.0 SOR_DESCRIPTION_WALLS:APPLY 2 COATS MASONRY PAINT (NOT FTF): 0.0 SOR_DESCRIPTION_DOOR FURNITURE:REFIX ANY LOOSE FITTING: 0.0 SOR_DESCRIPTION_DOOR FURNITURE:RENEW OR INSTALL CHAIN: 0.0 SOR_DESCRIPTION_DOOR FURNITURE:RENEW SET OF LEVER HANDLES: 0.0 SOR_DESCRIPTION_Compartmentation (Small): 0.0 SOR_DESCRIPTION_CEILING HATCH:RENEW BLOCKBOARD ACCESS HATCH (NOT FTF): 0.0 SOR_DESCRIPTION_Carry out LRA on Block between 7-20 units: 0.0 SOR_DESCRIPTION_Call out cost, normal hours ? Maintenance Engineer : 0.0 SOR_DESCRIPTION_WALLS:WASH APPLY 2 COATS MASONRY PAINT (NOT FTF): 0.0 SOR_DESCRIPTION_SURFACES:PREPARE PRIME 3 COATS DECORATIVE STAIN (NOT FTF): 0.0 SOR_DESCRIPTION_CLOSER:RENEW OR SUPPLY PERKO TYPE: 0.0 SOR_DESCRIPTION_COMMUNAL WASTE CLEARANCE:CAR TYRES: 0.0 SOR_DESCRIPTION_COMMUNAL WASTE CLEARANCE:COOKERS: 0.0 SOR_DESCRIPTION_WALLS:PREPARE 2 COATS BITUMIN DAMP PROOF (NOT FTF): 0.0 SOR_DESCRIPTION_CONDENSATION DRIP TRAY:RENEW: 0.0 SOR_DESCRIPTION_Carry out LRA on Block between 21-40 units: 0.0 SOR_DESCRIPTION_DOOR:EASE OIL AND ADJUST STEEL (RTR WITHIN 7 WORKING DAYS): 0.0 SOR_DESCRIPTION_SURFACE:STRIP BACK SURFACES NE 300MM: 0.0 SOR_DESCRIPTION_CUPBOARD:RENEW 50MM BRASS LOCK: 0.0 JOB_TYPE_DESCRIPTION_Asbestos Inspections Planned: 0.0 SOR_DESCRIPTION_CURTAIN BATTEN:RENEW OR FIX NEW: 0.0 SOR_DESCRIPTION_CURTAIN TRACK:RENEW INCLUDING RUNNERS: 0.0 SOR_DESCRIPTION_DOOR:EASE ADJUST REHANG INTERNAL NEW BUTTS: 0.0 SOR_DESCRIPTION_CYLINDER:REPAIR LEAK (RTR WITHIN 3 WORKING DAYS): 0.0 SOR_DESCRIPTION_TREE:PRUNE: 0.0 SOR_DESCRIPTION_CEILING HATCH:OVERHAUL HATCH: 0.0 SOR_DESCRIPTION_WASHING MACHINE:CLEAR BLOCKED WASTE: 0.0 SOR_DESCRIPTION_TESTING:THREE MONTHLY TEST EMERGENCY LIGHTING: 0.0 SOR_DESCRIPTION_Service Through Floor Lift: 0.0 SOR_DESCRIPTION_BATH:RENEW 1700MM STEEL WITH TAPS: 0.0 SOR_DESCRIPTION_BATH:RENEW 42MM CP WASTE OVERFLOW: 0.0 JOB_TYPE_DESCRIPTION_Domestic Lifts Inspections: 0.0 SOR_DESCRIPTION_BATH:RENEW PLUG AND CHAIN: 0.0 SOR_DESCRIPTION_BATH:TOUCH UP CHIP: 0.0 SOR_DESCRIPTION_BATH:RENEW 1700MM ACRYLIC WITH TAPS: 0.0 SOR_DESCRIPTION_BEAD:APPLY SILICONE SEAL TO GLAZING BEAD: 0.0 SOR_DESCRIPTION_BOILER:RENEW PROGRAMMER ? WIRELESS: 0.0 SOR_DESCRIPTION_BOILER:RENEW ROOM THERMOSTAT: 0.0 SOR_DESCRIPTION_THERMOSTAT:RENEW ROOM TYPE TO ELECTRIC HEATING: 0.0 SOR_DESCRIPTION_BOILER:RENEW THERMOSTAT: 0.0 SOR_DESCRIPTION_BOLT:RENEW 200MM TOWER BOLT: 0.0 SOR_DESCRIPTION_Service Stairlift : 0.0 SOR_DESCRIPTION_BOILER:RENEW PROGRAMMER - WIRELESS: 0.0 SOR_DESCRIPTION_BURST:REPAIR BURST PIPE NE 28MM (RTR WITHIN 12HRS): 0.0 JOB_TYPE_DESCRIPTION_Door Access Control Repairs & Service: 0.0 SOR_DESCRIPTION_BATH:RE-ENAMEL: 0.0 SOR_DESCRIPTION_BASIN OR SINK:CLEAR BLOCKAGE (RTR WITHIN 3 WORKING DAYS): 0.0 SOR_DESCRIPTION_BASIN:OVERHAUL CP POP-UP WASTE: 0.0 SOR_DESCRIPTION_BASIN:REFIX WASH HAND BASIN: 0.0 SOR_DESCRIPTION_BASIN:RENEW PEDESTAL ONLY: 0.0 SOR_DESCRIPTION_BASIN:RENEW PLUG AND CHAIN: 0.0 SOR_DESCRIPTION_WALL:REPAIR LARGE PATCH IN FACINGS: 0.0 SOR_DESCRIPTION_BATH:REMOVE AND REFIX (NOT FTF): 0.0 SOR_DESCRIPTION_BATH PANEL:REMOVE AND REFIX ANY TYPE: 0.0 SOR_DESCRIPTION_Servicing : Block: 6 Units and under - Fire Detection System: 0.0 SOR_DESCRIPTION_BATH PANEL:RENEW HARDBOARD END EXISTING FRAMING (NOT FTF): 0.0 SOR_DESCRIPTION_DOOR:RENEW 1^/2HR FLUSH COMMUNAL DOOR (NOT FTF): 0.0 SOR_DESCRIPTION_WASTE CHUTE:INSPECT TAKE DOWN REPAIR: 0.0 SOR_DESCRIPTION_BATH PANEL:RENEW HARDBOARD SIDE EXISTING FRAMING (NOT FTF): 0.0 SOR_DESCRIPTION_BATH:CLEAR BLOCKAGE TO WASTE (RTR WITHIN 3 WORKING DAYS): 0.0 SOR_DESCRIPTION_BATH PANEL:RENEW ACRYLIC SIDE (NOT FTF): 0.0 SOR_DESCRIPTION_BURST:REPAIR LEAKING FITTING NE 28MM (RTR WITHIN 12HRS): 0.0 SOR_DESCRIPTION_DOOR:CUT OPENINGS^/SUPPLY AND FIX 225X75MM VENTS: 0.0 SOR_DESCRIPTION_Self Closers - Double Door Sets: 0.0 SOR_DESCRIPTION_CARBON MONOXIDE DETECTOR:INSTALL MAINS OPERATED: 0.0 SOR_DESCRIPTION_DOOR:PATCH REPAIR INTERNAL: 0.0 SOR_DESCRIPTION_DOOR:PATCH OR REPAIR DOOR STILE: 0.0 SOR_DESCRIPTION_Self Closers - Single Door Sets: 0.0 SOR_DESCRIPTION_CCTV:SERVICE ANY TYPE: 0.0 JOB_TYPE_DESCRIPTION_Communal Area Building Safety Inspection: 0.0 JOB_TYPE_DESCRIPTION_Commercial Lifts Inspections: 0.0 SOR_DESCRIPTION_CCU:RENEW HRC FUSE: 0.0 SOR_DESCRIPTION_Board up window^/Door: 0.0 SOR_DESCRIPTION_CEILING HATCH:FORM OPENING COMPLETE (NOT FTF): 0.0 SOR_DESCRIPTION_CABLE:FIX 6.0MM T AND E AND MINI-TRUNKING: 0.0 JOB_TYPE_DESCRIPTION_Communal Gas Inspections: 0.0 SOR_DESCRIPTION_DOOR:PROVIDE HARDWOOD RAIN DEFLECTOR: 0.0 SOR_DESCRIPTION_CARPET:RENEW TO COMMUNAL AREAS (NOT FTF): 0.0 SOR_DESCRIPTION_TRADE JOB PLUMBING POUND (NOT FTF): 0.0 SOR_DESCRIPTION_Bulk or Area sample(s) with Analysis: 0.0 ABANDON_REASON_DESC_Tenant Missed Appt: 5.533260118829845e-23 JOB_TYPE_DESCRIPTION_Water Hygiene Inspections: 1.1259531910668151e-19 SOR_DESCRIPTION_Firestopping and Hole Filling area up to 100 sq m.: 1.126190790290246e-19 SOR_DESCRIPTION_MANHOLE:REBED COVER AND OR FRAME: 1.3516627463604104e-19 SOR_DESCRIPTION_: 1.3599363743048625e-19 CONTRACTOR_Contractor 12: 1.7773985174713008e-19 CONTRACTOR_Contractor 22: 2.6066583556027595e-19 SOR_DESCRIPTION_HANDRAIL:REFIX ANY LOOSE TYPE: 2.798302023061127e-19 JOB_TYPE_DESCRIPTION_Pre-Inspection: 3.07391253456585e-19 SOR_DESCRIPTION_WALL:DEMOLISH PLASTERED 1^/2B WALL: 3.417348048402014e-19 SOR_DESCRIPTION_STEP:REPAIR DAMAGED CONCRETE: 3.746151108808494e-19 CONTRACTOR_Contractor 11: 4.3087964429292905e-19 JOB_STATUS_DESCRIPTION_Pre-Inspection: 4.378464859511696e-19 SOR_DESCRIPTION_DOMESTIC APPLIANCE:RESPONSE AND REPAIR NE ?35.00: 4.761032881154956e-19 SOR_DESCRIPTION_WALL:BUILD 1^/2B WALL IN COMMONS (NOT FTF): 4.961911732018795e-19 SOR_DESCRIPTION_WINDOW FRAME:PROVIDE AND FIX BEAD OR SEALANT: 5.560112370218207e-19 SOR_DESCRIPTION_SOCKET:RENEW 13A SINGLE PLATE: 5.584871000861481e-19 SOR_DESCRIPTION_WINDOW:RENEW GLAZING BEAD: 6.617995284894769e-19 SOR_DESCRIPTION_CLIENT INSPECTION:REMOVE AND RELAY INSULATION: 7.002697549149854e-19 SOR_DESCRIPTION_CLOSER:RENEW LIGHT DUTY OVERHEAD: 7.005650982707866e-19 SOR_DESCRIPTION_DRAIN:JET BLOCKAGE (RTR WITHIN 12HRS): 7.127670092383751e-19 SOR_DESCRIPTION_PV SOLAR INSTALLATION ON ROOF:OVERHAUL 3 PANEL: 7.254657695633176e-19 SOR_DESCRIPTION_SHOWER:OVERHAUL MIXING VALVE: 7.913670927102231e-19 ABANDON_REASON_DESC_Wrong Contractor: 9.546628748655763e-19 Latest Priority Description_38 Calendar Days - Compliance: 9.737098773512314e-19 SOR_DESCRIPTION_LIGHT:RENEW BATTEN HOLDER: 9.987452164501426e-19 SOR_DESCRIPTION_WASTE:RENEW LEAKING JOINT: 1.0051814051535967e-18 SOR_DESCRIPTION_ROOF:TEMPORARY REPAIR FLAT ROOF (RTR WITHIN 7 WORKING DAYS): 1.1565484530736447e-18 SOR_DESCRIPTION_SHOWER:RECONNECT AND TEST: 1.1658658544174532e-18 SOR_DESCRIPTION_DRAUGHTPROOF:RENEW PLASTIC BRUSH TYPE: 1.3258415069709158e-18 Jobsourcedescription_Letter: 1.3468577911749982e-18 SOR_DESCRIPTION_FIRE:OVERHAUL DEFECTIVE SOLID FUEL APPLIANCE: 1.4755074276342524e-18 SOR_DESCRIPTION_BATH PANEL:RENEW ACRYLIC SIDE AND END (NOT FTF): 1.5393246549593488e-18 SOR_DESCRIPTION_WALL TILES:REMOVE AND REFIX: 1.6413593035027391e-18 SOR_DESCRIPTION_FRAME:SPLICE EXTERNAL REPAIR NE 1.0M: 1.7004241133164213e-18 SOR_DESCRIPTION_GULLY:CLEAN FLUSH OUT CLEAR BLOCKAGE (RTR WITHIN 14 DAYS): 1.8392787328850655e-18 SOR_DESCRIPTION_TANK:DRAIN CLEAN OUT AND REFILL CWST TANK: 2.5286816767360937e-18 SOR_DESCRIPTION_WC PAN:CLEAR BLOCKAGE INCLUDING REMOVE(RTR WITHIN3WORKDAYS): 2.5471640183711014e-18 SOR_DESCRIPTION_FLAG:LIFT AND REBED SINGLE PCC PAVING: 3.6983838068342606e-18 SOR_DESCRIPTION_GAS COOKER:RECONNECT AND TEST: 2.7693315853759048e-11 SOR_DESCRIPTION_Check electrics : 1.6519333166986556e-10 SOR_DESCRIPTION_Renew lock : 4.660022421846015e-10 SOR_DESCRIPTION_BOILER:REMEDY FAULT TO CONTROLS: 1.0158516186095637e-09 SOR_DESCRIPTION_RADIATOR VALVE:OVERHAUL ANY TYPE (RTR WITHIN 14 WORKDAYS): 1.1476410062320227e-09 JOB_TYPE_DESCRIPTION_Asbestos Repairs Reactive: 1.2889980008723546e-09 Initial Priority Description_Damp and Mould Inspection: 1.7223315721538821e-09 SOR_DESCRIPTION_CARBON MONOXIDE DETECTOR:INSTALL BATTERY TYPE (NOT FTF): 2.278926519394419e-09 JOB_TYPE_DESCRIPTION_Gas Exclusion: 3.347822847547495e-09 SOR_DESCRIPTION_Targeted survey only: 4.0546381225957696e-09 SOR_DESCRIPTION_Remove^/dispose TC plasterboard from wall^/ceiling over 10m?: 4.646624320467503e-09 JOB_TYPE_DESCRIPTION_Domestic Lifts Repairs: 8.386679366575962e-09 Latest Priority Description_Health & Safety - Compliance - 4 Hours: 9.501639060367132e-09 SOR_DESCRIPTION_ROOF:FIX ROOF LEAK TO PITCHED ROOF (RTR WITHIN 7 WORK DAYS): 1.1145107540474979e-08 Initial Priority Description_Emergency - 12 Calendar Hours: 1.336492373383741e-08 Latest Priority Description_Pre Inspection 5 Working Days: 1.3814915127788606e-08 SOR_DESCRIPTION_BEETLE:DISINFEST BEETLE INFESTATION: 1.424605246640181e-08 JOB_TYPE_DESCRIPTION_Rechargeable Repairs: 1.5389905633493258e-08 CONTRACTOR_Contractor 3: 1.5659090563728078e-08 SOR_DESCRIPTION_LOCK:OVERHAUL ANY LOCK COMPLETE (RTR WITHIN 7 WORKING DAYS): 1.691947054374977e-08 TRADE_DESCRIPTION_Warden Call: 1.730636796801125e-08 SOR_DESCRIPTION_SURFACES:APPLY BACTDET TO EXISTING SURFACES: 1.759665899040687e-08 SOR_DESCRIPTION_CONTRACTOR INSPECTION AND WORKS (NOT FTF): 1.7764249967364135e-08 SOR_DESCRIPTION_Asb Action: Low Risk action: 1.8298154164061162e-08 SOR_DESCRIPTION_CLIENT INSPECTION:PLUMBING (RTR WITHIN 12HRS): 1.8396741451329998e-08 Initial Priority Description_10 Working Days - Compliance: 1.8410122067529586e-08 Initial Priority Description_Three Day Void: 1.928284156840947e-08 Property Type_Other Non-Rentable Space: 2.009056739067612e-08 SOR_DESCRIPTION_RENEW TWIN WALL FLUE PIPE TO INCL ALL BENDS (3NO)(NOT FTF): 2.0509941358872293e-08 Latest Priority Description_Three Day Void: 2.197290037192021e-08 SOR_DESCRIPTION_CHIMNEY:REBUILD 4 COURSE 1 FLUE (NOT FTF): 2.426059199254058e-08 SOR_DESCRIPTION_4 hr working day for a ACM contamination incident & report : 2.4757232259079793e-08 JOB_TYPE_DESCRIPTION_Water Risk Inspection: 2.562649461997609e-08 CONTRACTOR_Contractor 24: 2.7119806823259702e-08 Property Type_0: 3.0770530136555635e-08 SOR_DESCRIPTION_RADIATOR:RENEW SINGLE PANEL NE 600X600: 3.1660264449490406e-08 SOR_DESCRIPTION_High Risk Action - Extraction Use only: 3.240902893700815e-08 Latest Priority Description_7 Working Days - Compliance: 3.258818450991119e-08 Latest Priority Description_10 Working Days - Compliance: 3.757381513744794e-08 SOR_DESCRIPTION_DOOR FRAME:RAKE OUT AND REPOINT SEALANT: 4.188488001447653e-08 SOR_DESCRIPTION_DOWNPIPE:RENEW PVCU PIPE: 4.3241514095666067e-08 SOR_DESCRIPTION_WINDOW:RENEW TRICKLE VENT TO PVCU: 4.857995873510048e-08 SOR_DESCRIPTION_REMOVE REDUNDANT WALL OR WATER HEATER (NOT FTF): 4.861103110241676e-08 SOR_DESCRIPTION_SOLID FUEL OR BIOMASS BOILER:BREAKDOWN NO PARTS: 4.9265323745581255e-08 SOR_DESCRIPTION_Fixed call out cost, Sundays and Bank Holidays ? Maintenance Engineer.: 5.077190324290259e-08 CONTRACTOR_Contractor 30: 5.176077076568641e-08 SOR_DESCRIPTION_ROOM:REDECORATE UPTO 20SM CEILING AREA (NOT FTF): 6.135407713338013e-08 SOR_DESCRIPTION_WINDOW FRAME:CUT OUT AND SPLICE NEW SECTION: 6.179871275071945e-08 CONTRACTOR_Contractor 19: 6.266862218046608e-08 SOR_DESCRIPTION_CEILING:FIX NE 12.5MM PLASTERBOARD 3MM SKIM COAT: 6.791836065713238e-08 SOR_DESCRIPTION_DOUBLE GLAZED UNIT:REGLAZE 28MM EX 1SM SAFETY LOWE (NOT FTF): 6.930172587973734e-08 JOB_TYPE_DESCRIPTION_Gate and Barrier Repairs: 7.818691986380057e-08 SOR_DESCRIPTION_SUP & INSTALL RADIATOR EXTRA TO CONTRACT600X1200 SC(NOT FTF): 7.870860859314612e-08 SOR_DESCRIPTION_PANE:REGLAZE 4MM CLEAR OR OBSCURE UPTO 1.00SM (NOT FTF): 8.42702336652932e-08 SOR_DESCRIPTION_GARDEN OR COMMUNAL AREA:LABOUR SKIP RUBBISH: 8.699173550465583e-08 TRADE_DESCRIPTION_Mechanical Services: 8.78421298085762e-08 SOR_DESCRIPTION_FLAG:FILLET POINT JOINT TO WALL: 9.3447939090229e-08 SOR_DESCRIPTION_Attend site to find fault and fix including one hour labour, consumables and materials up to a value of ?10.00.: 9.434547190157252e-08 Initial Priority Description_Emergency - Compliance - 12 Hours: 9.481833962251552e-08 SOR_DESCRIPTION_KERB:LAY NEW 127X254MM PCC KERB (NOT FTF): 9.689565908297256e-08 SOR_DESCRIPTION_enclosure (<3m high) total floor area 10m? - 20m?: 1.223872741807715e-07 SOR_DESCRIPTION_RADIATOR:RENEW AND REFIX BRACKET: 1.240129130632648e-07 SOR_DESCRIPTION_MATERIALS ONLY EXCLUDING NATFED SOR: 1.4296794114001032e-07 SOR_DESCRIPTION_GUTTER:RENEW PVCU BRACKET: 1.442194647561727e-07 SOR_DESCRIPTION_AUTOMATIC GATES:INSPECT^/SERVICE GATES AND OPERATOR: 1.4857447577172965e-07 SOR_DESCRIPTION_Remove and dispose asbestos sink prior drained and isolated: 1.5130649462911023e-07 JOB_TYPE_DESCRIPTION_Asbestos Repairs Planned: 1.6923687125231706e-07 SOR_DESCRIPTION_HOURLY RATE LABOUR EXCLUDING NATFED SORS : 1.748109742450158e-07 SOR_DESCRIPTION_RADIATOR:MAKE GOOD CONNECTION: 1.8643758170293192e-07 SOR_DESCRIPTION_SUP & INSTALL RADIATOR EXTRA TO CONTRACT600X1400 SC(NOT FTF): 2.1808804408613837e-07 Latest Priority Description_112 Calendar Days - Compliance: 2.2731414189701107e-07 SOR_DESCRIPTION_WEATHERSTRIP:RENEW REBATED TO DOOR: 2.3490384246771593e-07 TRADE_DESCRIPTION_Out of Hours Work: 2.5299895956293686e-07 TRADE_DESCRIPTION_Lifts: 2.7480367001152804e-07 SOR_DESCRIPTION_HRLY RATE CHARGE FOR CALLS IN HOURS INCL TRAVEL(NOT FTF): 2.7579093871149526e-07 SOR_DESCRIPTION_HANDRAIL:6X50MM RAIL ON BRACKETS (NOT FTF): 2.7742413230821296e-07 SOR_DESCRIPTION_OUT OF HOURS (NOT FTF): 2.853069591605221e-07 SOR_DESCRIPTION_SHEET FLOORING:RENEW NON-SLIP AND SUB-BASE (NOT FTF): 2.939726399173141e-07 SOR_DESCRIPTION_FLOOR:CONSTRUCT SOFTWOOD FLOOR (NOT FTF): 2.9919789177662043e-07 SOR_DESCRIPTION_WORKTOP:RENEW NE 40MM THICK DOUBLE POST FORMED (NOT FTF): 3.017772604555855e-07 SOR_DESCRIPTION_KITCHEN UNIT:RENEW CUPBOARD BACK (NOT FTF): 3.0476319039014995e-07 SOR_DESCRIPTION_Firestopping, Small: 3.269791536431586e-07 SOR_DESCRIPTION_FLOOR TILES:RENEW VINYL TILES (NOT FTF): 3.5034252462758307e-07 Latest Priority Description_335 Calendar Days - Compliance: 3.567574655961355e-07 SOR_DESCRIPTION_CEILING:DRY LINE 12.5MM THERMALBOARD (NOT FTF): 3.6083047798047994e-07 SOR_DESCRIPTION_SOAKAWAY FOR CONDENSATE DRAIN (NOT FTF): 3.9542990349121633e-07 SOR_DESCRIPTION_VENT INSTALL HIGH AND LOW LEVEL COMPARTMENT VENT(NOT FTF): 4.5210627160580714e-07 SOR_DESCRIPTION_FAN:RENEW SINGLE SPEED WALL OR DUCT FAN: 4.6038103239382864e-07 SOR_DESCRIPTION_STEP:FORM OR RENEW PCC STEP IN PAVING: 4.768561436865569e-07 Initial Priority Description_Appointable - 20 Working Days: 4.907526411922853e-07 SOR_DESCRIPTION_TRADE JOB ELECTRICAL POUND (NOT FTF): 5.236897134169211e-07 Jobsourcedescription_CSC Web Chat: 5.569951439755822e-07 Latest Priority Description_56 Calendar Days - Compliance: 5.960093509834467e-07 SOR_DESCRIPTION_WINDOW FITTING:RENEW FASTENER STAY TO STEEL SASH: 6.053213012673613e-07 SOR_DESCRIPTION_ROOF TILE:OVERHAUL PLAIN TILE ROOF COMPLETE: 6.215885270259315e-07 SOR_DESCRIPTION_RODENTS:MICE ERADICATION: 6.376993614266429e-07 SOR_DESCRIPTION_CLIENT INSPECTION:ELECTRICAL (RTR WITHIN 12HRS): 6.37949051390298e-07 SOR_DESCRIPTION_CARPET:RENEW TO DOMESTIC AREAS (NOT FTF): 7.310527302620547e-07 SOR_DESCRIPTION_BOILER:RENEW ROOM THERMOSTAT ? WIRELESS: 7.311960153062317e-07 Latest Priority Description_Emergency - Compliance - 12 Hours: 7.85424342703555e-07 SOR_DESCRIPTION_SCAFFOLDING:PROVIDE NE 5M HIGH NE 10M GIRTH (NOT FTF): 7.984645413486842e-07 SOR_DESCRIPTION_ROOF:FIX ROOF LEAK TO FLAT ROOF (RTR WITHIN 7 WORKING DAYS): 8.048541656512863e-07 SOR_DESCRIPTION_BATTERY SMOKE ALARMS WITH 10YR BATTERY LIFE (NOT FTF): 8.21902081534562e-07 SOR_DESCRIPTION_SUP & INSTALL RADIATOR EXTRA TO CONTRACT 600X100SC(NOT FTF): 8.952304619514944e-07 Initial Priority Description_335 Calendar Days - Compliance: 9.550693249483663e-07 SOR_DESCRIPTION_WALL TILES:NEW SPLASHBACK TO BATH: 9.65708020976404e-07 SOR_DESCRIPTION_SKIRTING:RENEW SOFTWOOD SKIRTING (NOT FTF): 1.0693136431621027e-06 Initial Priority Description_56 Calendar Days - Compliance: 1.115537422244904e-06 SOR_DESCRIPTION_SOAKAWAY:EXCAVATE NEW (NOT FTF): 1.1293493235700411e-06 SOR_DESCRIPTION_INSTALLATION OF ROOM THERMOSTAT WHILST RENEWING (NOT FTF): 1.2113713164492748e-06 Mgt Area_MA3: 1.2548365847387266e-06 SOR_DESCRIPTION_VENT:RENEW VENT ROOF TILE UNIT (NOT FTF): 1.4172369386031899e-06 SOR_DESCRIPTION_SHOWER:INSTALL NEW NE 8.5KW UNIT: 1.4910260625843633e-06 SOR_DESCRIPTION_RAT:DISINFEST RAT INFESTATION: 1.7241696717327592e-06 SOR_DESCRIPTION_INSTALL REMOTE PROGRAM TIMER(IN LIEU OF STD TIMER)(NOT FTF): 1.745380116608712e-06 SOR_DESCRIPTION_RODENTS:RATS ERADICATION: 1.9030047116982851e-06 CONTRACTOR_Contractor 21: 1.9679837049315374e-06 SOR_DESCRIPTION_WC CISTERN^/PAN:OVERHAUL ANY TYPE (RTR WITHIN 3 WORKING DAYS): 1.98549794992772e-06 Initial Priority Description_7 Working Days - Compliance: 2.0281904728063416e-06 SOR_DESCRIPTION_OPENINGS:SECURITY FULL SCREENS FIRST WEEK (NOT FTF): 2.2457445426432253e-06 SOR_DESCRIPTION_CAP OFF OPEN GAS POINTS INCL DISCONNECT GAS COOKER(NOT FTF): 2.3279591345079116e-06 SOR_DESCRIPTION_Reglaze window - double glazed (NOT FTF): 2.4842923110822785e-06 SOR_DESCRIPTION_FLAGS:LIFT AND REBED PCC PAVING: 2.5237360056095115e-06 SOR_DESCRIPTION_TEST:OCCUPIED PROPERTY CERTIFICATE (RTR WITHIN 3 WORKDAYS): 2.6541988922413275e-06 ABANDON_REASON_DESC_Duplicate Order: 2.6582616427300683e-06 Jobsourcedescription_Via Website: 2.733451301018118e-06 SOR_DESCRIPTION_PIPE:RENEW 40MM WASTE: 2.8205532913781712e-06 Latest Priority Description_Urgent PFI Evolve RD Irvine EMB: 2.9217405747522478e-06 SOR_DESCRIPTION_AUTOMATIC DOORS:INSPECT^/SERVICE DOORS AND OPERATOR: 3.042892797476771e-06 SOR_DESCRIPTION_SUP & INSTALL RADIATOR EXTRA TO CONTRACT 600X600SC(NOT FTF): 3.2611962459582255e-06 SOR_DESCRIPTION_SCAFFOLDING:PROVIDE 5 TO 10M HIGH NE 5M GIRTH (NOT FTF): 3.274891417255653e-06 Initial Priority Description_Urgent PFI Evolve RD Irvine EMB: 3.3819431549618345e-06 SOR_DESCRIPTION_BATH:RENEW 1700MM ACRYLIC SHOWER MIXER - WET AREA: 3.4124521748371268e-06 SOR_DESCRIPTION_SHOWER TRAY:RENEW ACRYLIC COMPLETE: 3.4392684453265605e-06 CONTRACTOR_Contractor 6: 3.4827492155188153e-06 Jobsourcedescription_CSC Email: 3.5101786776586993e-06 SOR_DESCRIPTION_RADIATOR:DRAIN DOWN AND REFILL SYSTEM: 3.576303171444953e-06 JOB_TYPE_DESCRIPTION_Door Inspection and Repairs: 3.5976241962063463e-06 SOR_DESCRIPTION_LIGHT FITTING:REMOVE AND REFIX ANY INTERNAL TYPE: 3.6207985959975454e-06 SOR_DESCRIPTION_DOOR:OVERHAUL INTERNAL COMPLETE: 3.638640973971173e-06 Latest Priority Description_Emergency: 3.879532631953734e-06 SOR_DESCRIPTION_FIRE:REMOVE FIRE AND SEAL OPENING (NOT FTF): 3.88775937384794e-06 Jobsourcedescription_Asset Manager: 3.921867331190415e-06 SOR_DESCRIPTION_Replace bath (NOT FTF): 3.997488001748574e-06 TRADE_DESCRIPTION_Rechargeable: 4.067659149438558e-06 SOR_DESCRIPTION_TRADE JOB FENCING POUND (NOT FTF): 4.0868888264511345e-06 CONTRACTOR_Contractor 23: 4.128000747511785e-06 Latest Priority Description_Urgent GAS Evolve RD Irvine EMB: 4.252011989892584e-06 SOR_DESCRIPTION_AIRBRICK:INSTALL NEW PVC VENT: 4.37366591743259e-06 Initial Priority Description_Urgent GAS Evolve RD Irvine EMB: 4.400456425516449e-06 ABANDON_REASON_DESC_Inspection Not Required: 4.4364070323242574e-06 SOR_DESCRIPTION_PARTITION:ERECT TIMBER STUD AND PLASTERBOARD BF (NOT FTF): 4.713366441965164e-06 ABANDON_REASON_DESC_Alternative Job: 4.749228377633827e-06 SOR_DESCRIPTION_SWITCH:RENEW 5AMP NE 3 GANG 2 WAY PLATE: 4.910591506089218e-06 SOR_DESCRIPTION_BATH:RENEW 1700MM STEEL WITH SHOWER MIXER: 4.983282602995857e-06 SOR_DESCRIPTION_COMMUNAL TV AERIAL:ATTEND FAULT: 5.004099581953255e-06 Initial Priority Description_Emergency: 5.221612344147732e-06 SOR_DESCRIPTION_INSTALL ROOM THERMO RADIATOR VALVES RENEW BOILER(NOT FTF): 5.358930367711193e-06 SOR_DESCRIPTION_TURN ON AND TEST GAS APP FOR NEW TENANT &CP12(TOAT)(NOT FTF): 5.361827430197395e-06 SOR_DESCRIPTION_PLASTER REPAIR:REPAIR CRACK TO WALL OR CEILING: 5.489219579084957e-06 SOR_DESCRIPTION_DWELLING:CLEAN TO LETTABLE STANDARD: 5.5816120550968085e-06 SOR_DESCRIPTION_SOCKET:RENEW FLUSH BOX: 5.838371694545655e-06 SOR_DESCRIPTION_ROOF SUP AND FIX AND LATER REMOVE TARP(RTRWITHIN7WORK DAYS): 6.020276321453881e-06 SOR_DESCRIPTION_ROOM:REDECORATE OVER 30SM CEILING AREA (NOT FTF): 6.76587716750813e-06 SOR_DESCRIPTION_SUPPLY AND FIT BATTERY OPERATED CO DETECTOR (NOT FTF): 6.832005381504769e-06 SOR_DESCRIPTION_FIND FAULT AND FIX:: 7.437071489049891e-06 SOR_DESCRIPTION_RADIATOR:RELOCATE POSITION: 7.466436628720978e-06 SOR_DESCRIPTION_CEILING:HACK RENEW PLASTER: 7.73726244832589e-06 SOR_DESCRIPTION_CLOSER:EASE AND ADJUST ANY TYPE: 8.053244677338158e-06 SOR_DESCRIPTION_FENCING:MAKE SAFE FENCING: 8.05996153691054e-06 Latest Priority Description_3 Working Days - Compliance: 8.187147632067319e-06 Latest Priority Description_Discretionary: 8.197090282297835e-06 CONTRACTOR_Contractor 8: 8.377588289586968e-06 SOR_DESCRIPTION_CONTRACTOR VOID INSPECTION AND WORKS (NOT FTF): 8.493170709042467e-06 SOR_DESCRIPTION_DOWNPIPE:RENEW PVCU HOPPERHEAD: 8.659408678664572e-06 SOR_DESCRIPTION_Removal^/disposal excluding bitumen adhesive - > 10m? - 15m?: 8.867612403786351e-06 SOR_DESCRIPTION_TRADE JOB BRICKWORK - BLOCKWORK POUND (NOT FTF): 8.971070379346894e-06 Initial Priority Description_Discretionary: 9.003505173532461e-06 JOB_TYPE_DESCRIPTION_Communal Responsive Repairs: 9.592619391677265e-06 Initial Priority Description_76 Calendar Days - Compliance: 9.609539199474125e-06 SOR_DESCRIPTION_TEST:OCCUPIED PROPERTY AND REPORT (RTR WITHIN 3 WORKDAYS): 9.912210464814573e-06 SOR_DESCRIPTION_SUP & INSTALL RADIATOR EXTRA TO CONTRACT 600X600DC(NOT FTF): 1.039389985516898e-05 SOR_DESCRIPTION_MANHOLE:RENEW COVER LIGHT GRADE (NOT FTF): 1.0571592579289069e-05 SOR_DESCRIPTION_WINDOW:RENEW WITH PVCU CASEMENT 2 LIGHT (NOT FTF): 1.1095981596509036e-05 SOR_DESCRIPTION_MANHOLE:CLEAR BLOCKAGE NE 3.0M DEEP (RTR WITHIN 12HRS): 1.1217431344270041e-05 SOR_DESCRIPTION_FAN:OVERHAUL DOMESTIC EXTRACT (RTR WITHIN 7 WORKING DAYS): 1.1293369391226725e-05 SOR_DESCRIPTION_ROOF TILE:OVERHAUL INTERLOCKING ROOF COMPLETE: 1.159645903267491e-05 SOR_DESCRIPTION_SURFACES:APPLY 2 COATS BIOCHECK (NOT FTF): 1.1828973962950611e-05 Latest Priority Description_76 Calendar Days - Compliance: 1.193849361138455e-05 SOR_DESCRIPTION_SHEET FLOORING:LAY NEW NON-SLIP (NOT FTF): 1.2158623135304647e-05 JOB_TYPE_DESCRIPTION_Asbestos Repairs Void: 1.2463102964138527e-05 SOR_DESCRIPTION_SHOWER:RENEW NE 8.5KW UNIT: 1.2724089378343287e-05 SOR_DESCRIPTION_Removal^/disposal FT excluding bitumen adhesive >1m? - 5m?: 1.3155437318778304e-05 JOB_TYPE_DESCRIPTION_Fire Risk Repairs: 1.3266403519680814e-05 SOR_DESCRIPTION_SEALANT:RENEW TO SHOWER TRAY: 1.3316900459178856e-05 SOR_DESCRIPTION_WALL TILES:RENEW OR FIX NEW GLAZED TILES: 1.3624226729371233e-05 SOR_DESCRIPTION_FENCING:RENEW VERTICAL BOARD 1.675M: 1.4598376037971579e-05 JOB_TYPE_DESCRIPTION_Fire Safety Equipment Repairs: 1.562271214706369e-05 SOR_DESCRIPTION_DE-HUMIDIFIER:SUPPLY TEMPORARY [RATE PER WEEK]: 1.5661428525090156e-05 JOB_TYPE_DESCRIPTION_Communal Gas Repairs: 1.611173555673905e-05 SOR_DESCRIPTION_ROOFSPACE:CLEAR OUT COMPLETE: 1.769776433504025e-05 SOR_DESCRIPTION_WALL TILES:HACK OFF AND MAKE GOOD: 1.7964907142598512e-05 Latest Priority Description_28 Calendar Days - Compliance: 1.8037900057302384e-05 Initial Priority Description_3 Working Days - Compliance: 1.8153801806027152e-05 SOR_DESCRIPTION_WALL OR CEILING:APPLY ANTI-FUNGICIDE: 1.9207948975027605e-05 SOR_DESCRIPTION_CLIENT INSPECTION:GROUNDWORKS: 1.9793848501572306e-05 SOR_DESCRIPTION_CEILING:APPLY 5MM SKIM TO ARTEX: 1.9809856343310508e-05 CONTRACTOR_Contractor 1: 2.000655930233608e-05 SOR_DESCRIPTION_PIPEWORK INCL ALL FITTING MATERIALS AND LABOUR (NOT FTF): 2.0345170367733934e-05 Jobsourcedescription_Compliance System: 2.043463618061468e-05 SOR_DESCRIPTION_Removal and disposal of AIB to wall or ceilings >5m2 ? 10m2: 2.0787562114727492e-05 SOR_DESCRIPTION_SUP & INSTALL RADIATOR EXTRA TO CONTRACT600X1200 DC(NOT FTF): 2.1063278041086185e-05 SOR_DESCRIPTION_KITCHEN UNIT:RENEW SINGLE BASE 400X600 (NOT FTF): 2.1873843051452615e-05 SOR_DESCRIPTION_GATE:RENEW TIMBER NE 1.0SM (NOT FTF): 2.223762818526793e-05 SOR_DESCRIPTION_FAN:RENEW TOILET OR BATHROOM FAN: 2.3129641410301613e-05 Initial Priority Description_28 Calendar Days - Compliance: 2.3290543544939806e-05 SOR_DESCRIPTION_WALL:CLEAR CAVITY AREAS NE 1.0SM: 2.441707213475485e-05 SOR_DESCRIPTION_FLOORBOARD:RENEW NE 1.0SM (RTR WITHIN 3 WORKING DAYS: 2.4453615026692082e-05 SOR_DESCRIPTION_SUP & INSTALL RADIATOR EXTRA TO CONTRACT600X1400 DC(NOT FTF): 2.4470717069507885e-05 SOR_DESCRIPTION_SHOWER:CLEAR BLOCKAGE INCLUDING REMOVE: 2.4619516778429188e-05 CONTRACTOR_N/A: 2.4667511430724474e-05 SOR_DESCRIPTION_CHIMNEY:RAKE OUT AND REPOINT STACK: 2.503987724651341e-05 CONTRACTOR_Contractor 7: 2.519191948949582e-05 SOR_DESCRIPTION_CLIENT INSPECTION:EXTERNAL JOINERY: 2.60201661572362e-05 SOR_DESCRIPTION_GATES:INSTALL PAIR TIMBER NE 2.5SM AND POST (NOT FTF): 2.7095736469430105e-05 SOR_DESCRIPTION_WORKTOP:REMOVE AND REFIX: 2.7377349313713682e-05 SOR_DESCRIPTION_DWELLING:DISINFESTATION WORKS: 2.772894674800422e-05 SOR_DESCRIPTION_Quoted Work: 2.8433851375006404e-05 SOR_DESCRIPTION_CLIENT INSPECTION:BRICKWORK AND STRUCTURE: 2.8687285230996498e-05 SOR_DESCRIPTION_DRAIN:CCTV SURVEY: 2.8756062106836442e-05 TRADE_DESCRIPTION_Drainage Works: 3.158847219093333e-05 SOR_DESCRIPTION_SHEET FLOORING:RENEW POLYSAFE (NOT FTF): 3.204861033282204e-05 SOR_DESCRIPTION_FENCING:RENEW HIT AND MISS 1.35M HIGH (NOT FTF): 3.2092170641816314e-05 SOR_DESCRIPTION_DWELLING:PROVIDE MINI SKIP FOR RUBBISH: 3.2476525651799607e-05 SOR_DESCRIPTION_CARRY OUT GUTTER CLEANING AND CLEARANCE USING EXTENDABLE GUT: 3.249117908826005e-05 SOR_DESCRIPTION_GARAGE:CLEAR DEBRIS: 3.273403513653352e-05 SOR_DESCRIPTION_WALL:RAKE OUT AND REPOINT JOINT OF BRICKWORK: 3.364529803254666e-05 SOR_DESCRIPTION_WALL:REPAIR SMALL PATCH IN FACINGS: 3.529010620402287e-05 SOR_DESCRIPTION_TRADE JOB PAINTING - DECORATING POUND (NOT FTF): 3.620335893202837e-05 SOR_DESCRIPTION_SURFACES:APPLY HALOPHEN (NOT FTF): 3.7025174744851996e-05 SOR_DESCRIPTION_GULLY:RENEW ANY TYPE SURROUND AND KERB: 4.075077354622216e-05 SOR_DESCRIPTION_DWELLING:REDECORATE 1 BEDROOM FLAT (NOT FTF): 4.090308472399116e-05 SOR_DESCRIPTION_FELT:RENEW HP TORCH ON CHIPPINGS 2L (NOT FTF): 4.3291576964560813e-05 SOR_DESCRIPTION_STAIRCASE:REFIX LOOSE TREAD (RTR WITHIN 3 WORKING DAYS): 4.656320685851538e-05 SOR_DESCRIPTION_SHEET FLOORING:RENEW POLYSAFE AND SUB-BASE (NOT FTF): 4.6699792904442e-05 SOR_DESCRIPTION_FENCING:RENEW HIT AND MISS 1.8M HIGH (NOT FTF): 4.7772839533102186e-05 SOR_DESCRIPTION_WALL:REBUILD 1^/2B WALL IN COMMONS (NOT FTF): 5.1769742692018506e-05 SOR_DESCRIPTION_WALL TILES:NEW GLAZED TILES TO BATHROOM (NOT FTF): 5.634911097948614e-05 Property Type_Block No Shared Area: 5.7449211507470756e-05 SOR_DESCRIPTION_WALL:REMOVE REFIX SINK BASE UNIT IN ASSOCIATION: 5.9849860385721506e-05 SOR_DESCRIPTION_ROOF COVERING:EXTRA TO RENEW FELT AND BATTENS (NOT FTF): 6.081244132735082e-05 SOR_DESCRIPTION_enclosure (<3m high) total floor area not exceeding 10m? : 6.522231833827636e-05 SOR_DESCRIPTION_BURST:EXCAVATE AND REPAIR BURST PIPE NE 28 MM: 6.622674211602632e-05 JOB_TYPE_DESCRIPTION_Void Repairs: 7.080450555857128e-05 Initial Priority Description_Pre Inspection 5 Working Days: 7.134677482670382e-05 TRADE_DESCRIPTION_Glazing: 7.304224512495034e-05 SOR_DESCRIPTION_BATH PANEL:RENEW HARDBD SIDE END EXISTING FRAMING (NOT FTF): 7.306796255539388e-05 SOR_DESCRIPTION_COPING:REBED BRICK ON EDGE COPING: 7.310819507077411e-05 SOR_DESCRIPTION_ROOF:SWEEP AND APPLY STONE CHIPPINGS: 7.333242903877187e-05 SOR_DESCRIPTION_CARPET:RENEW TO DOMESTIC AREAS: 7.94281871968846e-05 JOB_TYPE_DESCRIPTION_Aids and Adaptations: 8.464106519437406e-05 SOR_DESCRIPTION_SCAFFOLD TOWER:MOVE POSITION (NOT FTF): 8.559508687031351e-05 SOR_DESCRIPTION_Removal^/disposal FT excluding bitumen adhesive -> 5m? - 10m?: 8.628439468672393e-05 SOR_DESCRIPTION_JOIST:LEVEL JOIST WITH PACKINGS (NOT FTF): 8.889517235676353e-05 Initial Priority Description_Urgent - Compliance - 7 Calendar Days: 8.93326093856564e-05 TRADE_DESCRIPTION_Miscellaneous Works: 9.06719321086513e-05 SOR_DESCRIPTION_SCAFFOLDING:PROVIDE 5 TO 10M HIGH NE 20M GIRTH (NOT FTF): 9.187944303200054e-05 SOR_DESCRIPTION_DWELLING:CLEAR ENVIRONMENTALLY DIRTY: 9.443836242769416e-05 SOR_DESCRIPTION_DOUBLE GLAZED UNIT:REGLAZE UPTO 1.00SM-CLEAR LOW E (NOT FTF): 9.529685657349604e-05 SOR_DESCRIPTION_SURFACES:APPLY BACTDET AND HALOPHEN (NOT FTF): 9.71417124376132e-05 JOB_TYPE_DESCRIPTION_Fire Risk Repairs Planned: 9.876816136504368e-05 Latest Priority Description_Urgent - Compliance - 7 Calendar Days: 9.967238667879226e-05 SOR_DESCRIPTION_CLIENT INSPECTION:ROOFING: 0.00010466863571547071 CONTRACTOR_Contractor 10: 0.00010888310144899805 SOR_DESCRIPTION_GUTTER:REMAKE PVCU JOINT: 0.00010899681578209003 SOR_DESCRIPTION_GUTTER:CLEAN AND FLUSH OUT PER ELEVATION: 0.00011630014353888403 SOR_DESCRIPTION_HOLE:MAKE GOOD HOLE ANY DIAMETER: 0.00011784966079211553 SOR_DESCRIPTION_SUP & INSTALL RADIATOR EXTRA TO CONTRACT 600X100DC(NOT FTF): 0.00012823416861067945 SOR_DESCRIPTION_FENCING:RENEW 1.2M BOARD PCC POST GRAVEL BOARD (NOT FTF): 0.0001292615814937567 SOR_DESCRIPTION_DRAIN:INSTALL 100MM CLAY NE 1M DEEP (NOT FTF): 0.00013338702512063637 SOR_DESCRIPTION_GUTTER:RENEW 112MM PVCU COMPLETE: 0.0001382957593296958 SOR_DESCRIPTION_RIDGE:REMOVE AND REFIX TILES: 0.00014739421468013502 SOR_DESCRIPTION_DUCT:RENEW PLYWOOD SIDED CASING-450MM (NOT FTF): 0.00014787425505408607 Jobsourcedescription_Compliance Officer: 0.00014960520758616985 TRADE_DESCRIPTION_Plumbing: 0.00015496645309900963 JOB_TYPE_DESCRIPTION_XXXXXAsbestos Repairs: 0.00016631304893581654 ABANDON_REASON_DESC_Input Error: 0.00016697818596783962 SOR_DESCRIPTION_SKIRTING:TAKE OFF AND REFIX: 0.00017031035633575608 SOR_DESCRIPTION_CEILING:TAKE DOWN CEILING AFTER WATER DAMAGE: 0.00017067187657261458 SOR_DESCRIPTION_FLOOR TILES:REFIX VINYL TILES: 0.0001709992273549047 SOR_DESCRIPTION_WALL:RENEW APPLY 3MM SKIM PLASTER IN PATCH: 0.00017273865074653993 JOB_TYPE_DESCRIPTION_Fire Safety Equipment Inspections: 0.00017288728251885853 SOR_DESCRIPTION_TRADE JOB GAS SERVICE POUND (NOT FTF): 0.00017897069980288662 TRADE_DESCRIPTION_Concrete External Works: 0.00018437739842834614 SOR_DESCRIPTION_SCAFFOLDING:PROVIDE 5 TO 10M HIGH NE 10M GIRTH (NOT FTF): 0.00018602119748049195 Jobsourcedescription_Repairs Administrator: 0.00019885118360206413 SOR_DESCRIPTION_WORKTOP:RENEW NE 40MM THICK POST FORMED (NOT FTF): 0.00020037518879698218 CONTRACTOR_Contractor 5: 0.00020329994848154034 SOR_DESCRIPTION_DWELLING OR GARDEN:PROVIDE SKIP FOR RUBBISH: 0.0002033758738446093 SOR_DESCRIPTION_FENCING:RENEW 1.2M HIGH PANEL (NOT FTF): 0.00021030637721501693 SOR_DESCRIPTION_KITCHEN UNIT:RENEW PLINTH (NOT FTF): 0.0002142906540803581 SOR_DESCRIPTION_DOWNPIPE:REMAKE PVCU JOINTS OR REFIX FITTING: 0.0002226619799799199 SOR_DESCRIPTION_GARDEN:CLEAR EXCEPTIONAL DEBRIS: 0.00023115182362293545 Mgt Area_MA2: 0.00023249259750729765 SOR_DESCRIPTION_Common Area Store Fire Doors - Complete Single Door Set: 0.00023259646495982903 SOR_DESCRIPTION_WALL UNIT:SUPPLY 1000MM LONG ADJUSTABLE HEIGHT (NOT FTF): 0.00023817187543292759 SOR_DESCRIPTION_COMMUNAL WASTE CLEARANCE:FRIDGES, FREEZERS: 0.0002467567970087768 SOR_DESCRIPTION_KITCHEN UNIT:REMOVE AND REFIX ANY TYPE: 0.0002536983170179699 SOR_DESCRIPTION_WALL:REMOVE AND REFIX WASHBASIN IN ASSOCIATION (NOT FTF): 0.0002629596289139938 SOR_DESCRIPTION_WALL:HACK OFF REPLASTER 1.0M BAND: 0.000268017204508532 SOR_DESCRIPTION_PATH:EXCAVATE LAY NE 100MM CONCRETE BED (NOT FTF): 0.0002711545652335558 SOR_DESCRIPTION_WALLS AND CEILINGS:HANG LINING,WOODCHIP IN REPAIR (NOT FTF): 0.00027273838153461213 SOR_DESCRIPTION_Gas Service 3^/4 STAR (NOT FTF): 0.0002780684795946702 SOR_DESCRIPTION_SEALANT:RENEW TO SIDES AND ENDS OF BATH: 0.000281288608020476 Property Type_Default: 0.0002822328030071247 Jobsourcedescription_Scheme Staff/Care and Support Staff: 0.00028472810031598733 TRADE_DESCRIPTION_Electrician: 0.0003020477603158988 Jobsourcedescription_Contractor Report: 0.0003043696357722827 SOR_DESCRIPTION_FENCING:ERECT 1.8M HIGH PANEL WITH PCC POSTS (NOT FTF): 0.0003092807188595593 SOR_DESCRIPTION_WALL:RENEW APPLY 3MM SKIM COAT PLASTER: 0.00031299939734694767 SOR_DESCRIPTION_CHIMNEY:DEMOLISH STACK AND MAKE GOOD ROOF (NOT FTF): 0.0003270893579359502 SOR_DESCRIPTION_STAIN BLOCK:APPLY ONE COAT: 0.000332984479770288 SOR_DESCRIPTION_Full Kitchen renewal and associated works (small) (NOT FTF): 0.0003441052815300579 Latest Priority Description_Major Responsive Repairs: 0.0003617320415838353 SOR_DESCRIPTION_DWELLING:CLEAR OUT COMPLETE: 0.00037556967291420566 SOR_DESCRIPTION_CLADDING:REFIX LOOSE SHIPLAP FEATHER EDGE OR TG V: 0.0003763286980997882 SOR_DESCRIPTION_FLAGS:RENEW PCC PAVING (NOT FTF): 0.0003828273396808435 SOR_DESCRIPTION_STEP:FORM OR RENEW STEP IN CONCRETE PAVING: 0.0003892513300011446 SOR_DESCRIPTION_FASCIA^/BARGE:RENEW IN PVCU NE 300MM (NOT FTF): 0.0004010208576614391 SOR_DESCRIPTION_BATH PANEL:RENEW HARDBOARD SIDE END AND FRAMING (NOT FTF): 0.00041130720011357276 SOR_DESCRIPTION_FENCING:RENEW 1.8M BOARD PCC POST GRAVEL BOARD (NOT FTF): 0.00041430503159537737 Initial Priority Description_Major Responsive Repairs: 0.0004545191465315775 Jobsourcedescription_Housing Officer: 0.00047437909593391845 SOR_DESCRIPTION_VENT UNIT:INSTALL FLATMASTER 5 (NOT FTF): 0.0005025585664314744 TRADE_DESCRIPTION_Gas Repairs: 0.0005125650317192599 SOR_DESCRIPTION_DOMESTIC WASTE CLEARANCE:FRIDGES, FREEZERS: 0.0005699585304634742 SOR_DESCRIPTION_GARDEN OR COMMUNAL AREA:LABOUR MINI-SKIP RUBBISH: 0.0005769636376787503 SOR_DESCRIPTION_FIND FAULT AND FIX: 0.0005857420993060238 SOR_DESCRIPTION_VENT UNIT:INSTALL DRIMASTER 5 (NOT FTF): 0.0006274827195639761 SOR_DESCRIPTION_TEST:UNOCCUPIED PROPERTY CERTIFICATE: 0.0006357633263613641 SOR_DESCRIPTION_RADIATOR:REMOVE AND REFIX: 0.0006446406507499488 SOR_DESCRIPTION_DWELLING:CLEAR EXCEPTIONALLY DIRTY: 0.0006732763735807814 SOR_DESCRIPTION_TRADE JOB ROOFING POUND (NOT FTF): 0.0006857984558856906 SOR_DESCRIPTION_DWELLING:REDECORATE 3 BEDROOM FLAT (NOT FTF): 0.0007266281285872808 SOR_DESCRIPTION_CEILING:RENEW NE 12.5MM PLASTERBOARD 3MM SKIM COAT: 0.0008174141583643234 TRADE_DESCRIPTION_Pound Jobs No SOR: 0.0008230639095565064 TRADE_DESCRIPTION_Groundwork: 0.0008291525707455823 Mgt Area_MA1: 0.0008308812368358229 SOR_DESCRIPTION_WALL:DRY LINE 25MM THERMALBOARD (NOT FTF): 0.0008422225134520242 SOR_DESCRIPTION_WALL:HACK OFF RENDER AND SKIM 1.0M BAND: 0.0008673298667289474 SOR_DESCRIPTION_FENCING:RENEW 1.8M HIGH PANEL (NOT FTF): 0.0008750423924310291 SOR_DESCRIPTION_SHRUB:DIG OUT OVERGROWN: 0.000950685835680109 SOR_DESCRIPTION_SHED:CLEAR DEBRIS: 0.0009602013054680669 SOR_DESCRIPTION_WALL:BUILD 1B WALL IN FACINGS (NOT FTF): 0.0009603238716838329 SOR_DESCRIPTION_TRADE JOB DRAINAGE WORKS POUND (NOT FTF): 0.000977990254752347 SOR_DESCRIPTION_WALLS AND CEILINGS:APPLY MIST 2 COATS EMULSION (NOT FTF): 0.0009922606812199069 TRADE_DESCRIPTION_Painting and Decorating: 0.0010043419055019145 TRADE_DESCRIPTION_Brickwork/Blockwork: 0.001021800258164805 SOR_DESCRIPTION_INSULATION:LAY UPTO 270MM THICK QUILT (NOT FTF): 0.001023146974926295 SOR_DESCRIPTION_GARDEN:CLEAR DEBRIS: 0.0010494732696994693 SOR_DESCRIPTION_FENCING:RENEW SOFTWOOD RAIL: 0.0011580701092843025 SOR_DESCRIPTION_Quoted Works: 0.0011667118302835873 SOR_DESCRIPTION_DOOR FURNITURE:RENEW BATHROOM ESCUTCHEON SET: 0.001214861082307163 SOR_DESCRIPTION_TRADE JOB CARPENTRY POUND (NOT FTF): 0.0012170044441640253 Jobsourcedescription_Asset Officer: 0.0012203585433909919 TRADE_DESCRIPTION_Roofing: 0.001226768638156812 JOB_TYPE_DESCRIPTION_Responsive Repairs: 0.0012513745232455285 SOR_DESCRIPTION_WALL:FIX NE 12.5MM PLASTERBOARD 3MM SKIM IN PATCH: 0.0013061165658682901 Latest Priority Description_: 0.0013592840669607955 TRADE_DESCRIPTION_Floor Wall Ceilings: 0.0014656512352019982 SOR_DESCRIPTION_FAN:RENEW VARIABLE SPEED TIMER: 0.00153296590402192 SOR_DESCRIPTION_DOOR:RENEW HW PANELLED OR GLAZED FRONT DOOR (NOT FTF): 0.0016074923598454482 JOB_TYPE_DESCRIPTION_Suspected Damp: 0.0016286413521252533 TRADE_DESCRIPTION_Scaffold: 0.0017194625467581149 SOR_DESCRIPTION_DUCT:RENEW PLYWOOD DUCT CASING SIDE NE 300MM (NOT FTF): 0.001789839348913274 SOR_DESCRIPTION_landing pushes: 0.0019492720686729644 Jobsourcedescription_CSC Phone Call: 0.0019877419915525943 Latest Priority Description_Appointable: 0.0021102852555595674 SOR_DESCRIPTION_Materials supply or quoted works: 0.0021515528720351205 SOR_DESCRIPTION_WINDOW:OVERHAUL PVCU: 0.002220792621433032 TRADE_DESCRIPTION_Multi Trade: 0.0023354799909643745 SOR_DESCRIPTION_WALL:HACK OFF AND APPLY RENDER: 0.0023515190201985306 TRADE_DESCRIPTION_Fencing: 0.0023856636556575214 Latest Priority Description_Two Week Void: 0.0024073434847473396 ABANDON_REASON_DESC_No Work Required: 0.002614202807323906 SOR_DESCRIPTION_LATCH:RENEW MORTICE LATCH COMPLETE: 0.002639322493223672 SOR_DESCRIPTION_GARDEN:CUT GRASS OVER 100MM HIGH: 0.002762351084803884 SOR_DESCRIPTION_WALL:RAKE OUT AND REPOINT BRICKWORK: 0.002883236679068086 JOB_STATUS_DESCRIPTION_Work Completed: 0.0029589965329363003 SOR_DESCRIPTION_TRADE JOB GROUNDWORK POUND (NOT FTF): 0.0033919750370324953 Property Type_Semi Detached: 0.0034002609963852716 Initial Priority Description_Section 11 Works: 0.0038609579525993895 SOR_DESCRIPTION_Suspected Damp and Mould: 0.0038730883374609093 SOR_DESCRIPTION_WINDOW:CHILD RESTRICTOR TO PVCU: 0.003919223893791319 TRADE_DESCRIPTION_Carpenter: 0.003927056892762628 Latest Priority Description_Damp and Mould Follow-On Work: 0.004239928625909363 SOR_DESCRIPTION_DOOR:EASE ADJUST INCLUDING REMOVE (RTR WITHIN 7 WORKDAYS): 0.004314312249537887 Latest Priority Description_Section 11 Works: 0.004425235299299316 Initial Priority Description_Damp and Mould Follow-On Work: 0.004510656101456386 Property Type_Detached: 0.004833654361883069 JOB_TYPE_DESCRIPTION_Section 11 Repairs: 0.005457572531149266 Property Type_Access via internal shared area: 0.0056534889609630296 SOR_DESCRIPTION_WALL:DRY LINE 45MM THERMALBOARD (NOT FTF): 0.005743274152524271 Initial Priority Description_Appointable: 0.005816080314977805 SOR_DESCRIPTION_TRADE JOB FLOOR^/ PLASTERING (INC WALL AND CEILING)(NOT FTF): 0.007035462077127338 JOB_TYPE_DESCRIPTION_Gas Responsive Repairs: 0.007815888475852878 Property Type_Access direct: 0.008426574255650474 Property Type_End Terrace: 0.008555775811305457 TRADE_DESCRIPTION_Void Repairs: 0.009085421241135505 Property Type_Terrace: 0.009925460251133333 SOR_DESCRIPTION_DOOR:RENEW INTERNAL PLY FLUSH (NOT FTF): 0.010089142787758517 SOR_DESCRIPTION_DOOR:RENEW HANDLES TO PVCU: 0.010355379967925401 JOB_STATUS_DESCRIPTION_Abandoned: 0.01530460146925388 ABANDON_REASON_DESC_nan: 0.016943315324271728 JOB_STATUS_DESCRIPTION_Job Logged: 0.01717032383310542 JOB_STATUS_DESCRIPTION_Invoice Accepted: 0.035554828877933334 SOR_DESCRIPTION_TRADE JOB FLAGGING^/CONCRETE EXT WORKS POUND (NOT FTF): 0.04816103423224992 Jobsourcedescription_OneMobile app: 0.06459783068349055 Initial Priority Description_: 0.0708692440994211 Initial Priority Description_Two Week Void: 0.1119541976716037 Jobsourcedescription_Total Mobile App: 0.3858753906000063
# Define predictor variables and response variable
predictors = ['Property Type', 'Jobsourcedescription',
'Initial Priority Description', 'JOB_STATUS_DESCRIPTION',
'TRADE_DESCRIPTION', 'Latest Priority Description', 'Mgt Area', 'CONTRACTOR']
response = 'Total Value'
# One-hot encode categorical variables
categorical_features = predictors
one_hot = OneHotEncoder(handle_unknown='ignore')
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")
# Split the data
X = int_df_copy[predictors]
y = int_df_copy[response]
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Transform the datasets
X_train_transformed = transformer.fit_transform(X_train)
X_test_transformed = transformer.transform(X_test)
X_val_transformed = transformer.transform(X_val)
# Create a Pipeline with Random Forest model
pipeline = Pipeline([
('transformer', transformer),
('model', RandomForestRegressor(random_state=42))
])
# Parameter distributions for Randomized Search
param_distributions = {
'model__n_estimators': [300],
'model__max_depth': [20],
'model__min_samples_leaf': [1],
'model__max_features': ['auto']
}
# Set up K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV for hyperparameter tuning
random_search = GridSearchCV(pipeline, param_distributions, cv=kfold,
scoring='neg_mean_squared_error', verbose=2)
random_search.fit(X_train, y_train)
# Best model and parameters
best_model = random_search.best_estimator_.named_steps['model']
best_params = random_search.best_params_
print(f"Best Parameters: {best_params}")
# Predict and evaluate on training, testing, and validation sets
y_train_preds = best_model.predict(X_train_transformed)
y_test_preds = best_model.predict(X_test_transformed)
y_val_preds = best_model.predict(X_val_transformed)
# Calculate MSE for each set
train_mse = mean_squared_error(y_train, y_train_preds)
test_mse = mean_squared_error(y_test, y_test_preds)
val_mse = mean_squared_error(y_val, y_val_preds)
# Print MSE results
print(f"Training MSE: {train_mse}")
print(f"Validation MSE: {val_mse}")
print(f"Testing MSE: {test_mse}")
# Additional evaluation metrics
# Calculate RMSE for the training/testing/validation set
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_preds))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_preds))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_preds))
# Print the RMSE values
print(f"Training RMSE: {train_rmse}")
print(f"Testing RMSE: {test_rmse}")
print(f"Validation RMSE: {val_rmse}")
# Calculate MAE for training/testing/validation data
train_mae = mean_absolute_error(y_train, y_train_preds)
val_mae = mean_absolute_error(y_val, y_val_preds)
test_mae = mean_absolute_error(y_test, y_test_preds)
# Calculate RΒ² for training/testing/validation data
train_r2 = r2_score(y_train, y_train_preds)
val_r2 = r2_score(y_val, y_val_preds)
test_r2 = r2_score(y_test, y_test_preds)
# Calculate MAPE for training/testing/validation data---
# We should not use MAPE here as 7,624 zero or near-zero values in the dataset, the MAPE calculation is likely encountering division by zero or near-zero situations, leading to the extremely high MAPE value.
# This makes MAPE an unreliable metric in this scenario.
# train_mape = mean_absolute_percentage_error(y_train, y_train_preds)
# test_mape = mean_absolute_percentage_error(y_test, y_test_preds)
# val_mape = mean_absolute_percentage_error(y_val, y_val_preds)
# Print MAE results
print(f"Training MAE: {train_mae}")
print(f"Validation MAE: {val_mae}")
print(f"Testing MAE: {test_mae}")
# Print R2 results
print(f"Training RΒ²: {train_r2}")
print(f"Validation R^2: {val_r2}")
print(f"Testing RΒ²: {test_r2}")
# Print MAPE results
# print(f"Training MAPE: {val_mape}")
# print(f"Validation MAPE: {val_mape}")
# print(f"Testing MAPE: {val_mape}")
# Residuals vs Predicted Plot
residuals = y_val - y_val_preds
plt.figure(figsize=(8, 6))
plt.scatter(y_val_preds, residuals, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Plot', fontweight="bold", fontsize = 16)
plt.show()
# Actual vs Predicted Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_val, y=y_val_preds, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Plot', fontweight="bold", fontsize = 16)
plt.show()
# Feature Importance Plot
# Fitting the OneHotEncoder separately to get the feature names
one_hot.fit(X_train[categorical_features])
feature_names = one_hot.get_feature_names_out(input_features=categorical_features)
feature_importances = best_model.feature_importances_
sorted_idx = feature_importances.argsort()
# Print sorted feature importances
print("Sorted Feature Importances:")
for idx in sorted_idx:
print(f"{feature_names[idx]}: {feature_importances[idx]}")
plt.figure(figsize=(10, 8))
plt.barh(range(len(sorted_idx)), feature_importances[sorted_idx], align='center')
plt.yticks(range(len(sorted_idx)), [feature_names[i] for i in sorted_idx])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance', fontweight="bold", fontsize = 16)
plt.show()
def plot_learning_curves(model, X_train, y_train, X_val, y_val, X_test, y_test, step=50, max_data_points=1000):
train_errors, val_errors, test_errors = [], [], []
# Use shape[0] to get the number of samples in the training set
n_train_samples = min(max_data_points, X_train.shape[0])
for m in range(1, n_train_samples, step):
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
y_test_predict = model.predict(X_test)
train_mse = mean_squared_error(y_train[:m], y_train_predict)
val_mse = mean_squared_error(y_val, y_val_predict)
test_mse = mean_squared_error(y_test, y_test_predict)
train_errors.append(train_mse)
val_errors.append(val_mse)
test_errors.append(test_mse)
plt.plot(np.sqrt(train_errors), label="Train")
plt.plot(np.sqrt(val_errors), label="Validation")
plt.plot(np.sqrt(test_errors), label="Test")
plt.xlabel("Training set size")
plt.ylabel("RMSE (Root Mean Squared Error)")
plt.legend()
plt.title("Training, Validation, and Test Loss Curves", fontweight="bold", fontsize = 16)
plt.show()
# Using the best_model that has been fitted to the entire training dataset
plot_learning_curves(best_model, X_train_transformed, y_train, X_val_transformed, y_val, X_test_transformed, y_test, step=50, max_data_points=1000)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 21.0s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 22.3s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 19.6s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 19.2s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 18.7s
Best Parameters: {'model__max_depth': 20, 'model__max_features': 'auto', 'model__min_samples_leaf': 1, 'model__n_estimators': 300}
Training MSE: 68094.18994148681
Validation MSE: 98479.57777529076
Testing MSE: 102937.64618915945
Training RMSE: 260.94863468025045
Testing RMSE: 320.83897236644964
Validation RMSE: 313.81455953363724
Training MAE: 86.90818905824285
Validation MAE: 95.43088259871578
Testing MAE: 95.06835148301428
Training RΒ²: 0.8469741687787155
Validation R^2: 0.6967492518238843
Testing RΒ²: 0.7164797666395144
Sorted Feature Importances: Jobsourcedescription_Letter: 0.0 CONTRACTOR_Contractor 17: 0.0 TRADE_DESCRIPTION_Play and Recreation: 0.0 CONTRACTOR_Contractor 9: 0.0 CONTRACTOR_Contractor 13: 0.0 CONTRACTOR_Contractor 28: 6.366752811881327e-20 Initial Priority Description_38 Calendar Days - Compliance: 1.811901564348495e-19 JOB_STATUS_DESCRIPTION_Pre-Inspection: 4.0517528316631927e-19 CONTRACTOR_Contractor 22: 5.05838553036661e-19 JOB_STATUS_DESCRIPTION_Note Job: 6.396613200455967e-19 TRADE_DESCRIPTION_Inspection: 8.120155651525551e-19 Latest Priority Description_38 Calendar Days - Compliance: 1.0651652705589845e-18 TRADE_DESCRIPTION_: 1.9713292273782773e-18 CONTRACTOR_Contractor 27: 1.6172331556168616e-11 CONTRACTOR_Contractor 14: 3.578348085562363e-09 TRADE_DESCRIPTION_Door Access Control: 5.587781996144724e-09 CONTRACTOR_Contractor 25: 5.991682563575968e-09 Initial Priority Description_Damp and Mould Inspection: 7.322849037020467e-09 CONTRACTOR_Contractor 2: 7.46244530488482e-09 CONTRACTOR_Contractor 11: 8.628093982126127e-09 Initial Priority Description_12 Calendar Hours: 9.430969364993823e-09 CONTRACTOR_Contractor 18: 9.475458167491725e-09 CONTRACTOR_Contractor 20: 3.082407702680393e-08 CONTRACTOR_Contractor 31: 4.5246887712358165e-08 CONTRACTOR_Contractor 26: 4.582293321353173e-08 Latest Priority Description_Emergency Health and Safety: 6.021398768683084e-08 CONTRACTOR_Contractor 29: 6.173953269898941e-08 Initial Priority Description_Emergency Health and Safety: 6.432432626352673e-08 Property Type_0: 1.5751674952938425e-07 Initial Priority Description_Urgent GAS - 3 Working Days: 3.1816929697957633e-07 TRADE_DESCRIPTION_Warden Call: 7.029307529206995e-07 TRADE_DESCRIPTION_Out of Hours Work: 7.477797865360423e-07 Initial Priority Description_Urgent GAS Evolve RD Irvine EMB: 7.543862877886553e-07 Property Type_Other Non-Rentable Space: 8.374102728798497e-07 Initial Priority Description_Emergency - 12 Calendar Hours: 8.696758886553782e-07 CONTRACTOR_Contractor 30: 8.986008624954535e-07 Latest Priority Description_Urgent GAS Evolve RD Irvine EMB: 9.181582267574199e-07 Initial Priority Description_112 Calendar Days - Compliance: 1.4911394272046158e-06 TRADE_DESCRIPTION_Water: 1.569532981152513e-06 CONTRACTOR_Contractor 3: 1.922230148287959e-06 Latest Priority Description_Three Day Void: 2.0362477681877988e-06 Initial Priority Description_10 Working Days - Compliance: 2.2486519704348644e-06 Latest Priority Description_112 Calendar Days - Compliance: 2.282337314802249e-06 Latest Priority Description_10 Working Days - Compliance: 2.6676062709979884e-06 Initial Priority Description_335 Calendar Days - Compliance: 2.726893467186173e-06 Initial Priority Description_Three Day Void: 3.087481705661216e-06 CONTRACTOR_Contractor 16: 3.3542799816161474e-06 Latest Priority Description_335 Calendar Days - Compliance: 4.133592849320158e-06 CONTRACTOR_Contractor 19: 5.503839784184164e-06 Latest Priority Description_7 Working Days - Compliance: 5.767560448738657e-06 TRADE_DESCRIPTION_Lifts: 6.577879018132933e-06 CONTRACTOR_Contractor 8: 7.1337661114386426e-06 Mgt Area_MA3: 7.2287813253610995e-06 TRADE_DESCRIPTION_Asbestos: 7.709548072110574e-06 Jobsourcedescription_Asset Manager: 7.711726809009356e-06 CONTRACTOR_Contractor 12: 8.52844331814783e-06 Initial Priority Description_7 Working Days - Compliance: 8.85242638954877e-06 Latest Priority Description_Emergency - Compliance - 12 Hours: 9.914788112109441e-06 TRADE_DESCRIPTION_Mechanical Services: 1.1126887224541009e-05 Initial Priority Description_Discretionary: 1.1887988742332501e-05 Initial Priority Description_Emergency - Compliance - 12 Hours: 1.2559845310104284e-05 Latest Priority Description_76 Calendar Days - Compliance: 1.2922862110335558e-05 Latest Priority Description_Discretionary: 1.3013557597520301e-05 Latest Priority Description_56 Calendar Days - Compliance: 1.4271503942877993e-05 Jobsourcedescription_CSC Web Chat: 1.4305178450478028e-05 CONTRACTOR_Contractor 21: 1.565914819288551e-05 Initial Priority Description_56 Calendar Days - Compliance: 1.7233457372782486e-05 Latest Priority Description_Emergency: 1.8119249463328555e-05 CONTRACTOR_Contractor 6: 2.011651170687613e-05 Initial Priority Description_Health & Safety - Compliance - 4 Hours: 2.0384591180642085e-05 Initial Priority Description_76 Calendar Days - Compliance: 2.0740524715161838e-05 Initial Priority Description_Appointable - 20 Working Days: 2.182092643759158e-05 Latest Priority Description_Health & Safety - Compliance - 4 Hours: 2.4159374081165858e-05 Jobsourcedescription_Via Website: 5.2161883639992006e-05 TRADE_DESCRIPTION_Disabled Adaptations: 5.234797561909343e-05 CONTRACTOR_Contractor 1: 5.355191239902898e-05 TRADE_DESCRIPTION_Rechargeable: 7.135422928795534e-05 Initial Priority Description_Urgent PFI Evolve RD Irvine EMB: 7.335083597818095e-05 CONTRACTOR_Contractor 24: 7.564869509191082e-05 Latest Priority Description_Urgent PFI Evolve RD Irvine EMB: 9.481476007931095e-05 CONTRACTOR_Contractor 10: 0.00010400863444281577 CONTRACTOR_Contractor 7: 0.00010672345794011614 Initial Priority Description_3 Working Days - Compliance: 0.00011164431748292804 Initial Priority Description_Emergency: 0.00011993712691346795 Latest Priority Description_Urgent - Compliance - 7 Calendar Days: 0.00012162264325325219 TRADE_DESCRIPTION_Fire: 0.00012995062595150397 Latest Priority Description_3 Working Days - Compliance: 0.000146457229894311 Jobsourcedescription_CSC Email: 0.00015280248179945957 Initial Priority Description_Urgent - Compliance - 7 Calendar Days: 0.00016592856321482515 Jobsourcedescription_Compliance Officer: 0.00017957249521177032 Property Type_Block No Shared Area: 0.00018631773509734972 TRADE_DESCRIPTION_Glazing: 0.00019466968943409005 Latest Priority Description_Pre Inspection 5 Working Days: 0.0001994305541121393 CONTRACTOR_Contractor 23: 0.0002041582935016837 Jobsourcedescription_Compliance System: 0.0002087971902821533 Property Type_Default: 0.0002517520542025904 CONTRACTOR_Contractor 4: 0.0002591039048543242 Initial Priority Description_Pre Inspection 5 Working Days: 0.00026636230136221364 Mgt Area_MA2: 0.0002886821792935392 TRADE_DESCRIPTION_Drainage Works: 0.0002940934490638873 CONTRACTOR_N/A: 0.00033321836184895584 TRADE_DESCRIPTION_Concrete External Works: 0.0003347192659913053 Initial Priority Description_Major Responsive Repairs: 0.000414325021319643 Jobsourcedescription_Scheme Staff/Care and Support Staff: 0.00046324715589840033 Latest Priority Description_Appointable: 0.0004637122784580315 Latest Priority Description_Major Responsive Repairs: 0.000525888270251653 CONTRACTOR_Contractor 5: 0.0005997286200768407 TRADE_DESCRIPTION_Plumbing: 0.0006255071150848874 Jobsourcedescription_Repairs Administrator: 0.0007476366208600384 Jobsourcedescription_Housing Officer: 0.0008827236281695462 Mgt Area_MA1: 0.0009343262601671753 TRADE_DESCRIPTION_Pound Jobs No SOR: 0.0010344905949274963 Latest Priority Description_28 Calendar Days - Compliance: 0.0010608786679499166 Initial Priority Description_28 Calendar Days - Compliance: 0.0012612175485561083 Latest Priority Description_: 0.001381883400240266 Jobsourcedescription_Contractor Report: 0.0014073297895118577 TRADE_DESCRIPTION_Electrician: 0.0015913017565076537 JOB_STATUS_DESCRIPTION_Work Completed: 0.001842571922018501 TRADE_DESCRIPTION_Scaffold: 0.0019684945682783925 Jobsourcedescription_Asset Officer: 0.0020234812681157372 Jobsourcedescription_CSC Phone Call: 0.0020906568966883605 TRADE_DESCRIPTION_Groundwork: 0.0024788782439768985 TRADE_DESCRIPTION_Painting and Decorating: 0.0024806413999263043 Latest Priority Description_Two Week Void: 0.0025581475329050757 TRADE_DESCRIPTION_Miscellaneous Works: 0.0028048623384959994 TRADE_DESCRIPTION_Brickwork/Blockwork: 0.002982775601556237 TRADE_DESCRIPTION_Roofing: 0.0036000964166211877 TRADE_DESCRIPTION_Fencing: 0.0037516640205230095 Initial Priority Description_Appointable: 0.004094700921138515 Property Type_Access via internal shared area: 0.004755610227683422 Property Type_Detached: 0.005946522677665455 Property Type_Semi Detached: 0.006033120309752705 TRADE_DESCRIPTION_Floor Wall Ceilings: 0.006412438162157109 Latest Priority Description_Section 11 Works: 0.006624249057050016 Property Type_End Terrace: 0.006816813147548956 TRADE_DESCRIPTION_Carpenter: 0.00784505179620726 Initial Priority Description_Section 11 Works: 0.008023157723423208 TRADE_DESCRIPTION_Gas Repairs: 0.008156583767419978 Property Type_Terrace: 0.010045573961134643 TRADE_DESCRIPTION_Void Repairs: 0.011807556599154374 Property Type_Access direct: 0.014165131181853089 JOB_STATUS_DESCRIPTION_Job Logged: 0.0228875226620275 Initial Priority Description_Damp and Mould Follow-On Work: 0.025981562067299666 Latest Priority Description_Damp and Mould Follow-On Work: 0.02814595996643526 JOB_STATUS_DESCRIPTION_Invoice Accepted: 0.03198102176918733 JOB_STATUS_DESCRIPTION_Abandoned: 0.03350588050854862 TRADE_DESCRIPTION_Multi Trade: 0.036181547681953675 Jobsourcedescription_OneMobile app: 0.0687705524264699 Initial Priority Description_: 0.08077762049911612 Initial Priority Description_Two Week Void: 0.11448970386431812 Jobsourcedescription_Total Mobile App: 0.4093851285846535
# Define predictor variables and response variable
predictors = ['Property Type','ABANDON_REASON_DESC','Mgt Area']
response = 'Total Value'
# One-hot encode categorical variables
categorical_features = predictors
one_hot = OneHotEncoder(handle_unknown='ignore')
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")
# Split the data
X = int_df_copy[predictors]
y = int_df_copy[response]
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Transform the datasets
X_train_transformed = transformer.fit_transform(X_train)
X_test_transformed = transformer.transform(X_test)
X_val_transformed = transformer.transform(X_val)
# Create a Pipeline with Random Forest model
pipeline = Pipeline([
('transformer', transformer),
('model', RandomForestRegressor(random_state=42))
])
# Parameter distributions for Randomized Search
param_distributions = {
'model__n_estimators': [300],
'model__max_depth': [20],
'model__min_samples_leaf': [1],
'model__max_features': ['auto']
}
# Set up K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV for hyperparameter tuning
random_search = GridSearchCV(pipeline, param_distributions, cv=kfold,
scoring='neg_mean_squared_error', verbose=2)
random_search.fit(X_train, y_train)
# Best model and parameters
best_model = random_search.best_estimator_.named_steps['model']
best_params = random_search.best_params_
print(f"Best Parameters: {best_params}")
# Predict and evaluate on training, testing, and validation sets
y_train_preds = best_model.predict(X_train_transformed)
y_test_preds = best_model.predict(X_test_transformed)
y_val_preds = best_model.predict(X_val_transformed)
# Calculate MSE for each set
train_mse = mean_squared_error(y_train, y_train_preds)
test_mse = mean_squared_error(y_test, y_test_preds)
val_mse = mean_squared_error(y_val, y_val_preds)
# Print MSE results
print(f"Training MSE: {train_mse}")
print(f"Validation MSE: {val_mse}")
print(f"Testing MSE: {test_mse}")
# Additional evaluation metrics
# Calculate RMSE for the training/testing/validation set
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_preds))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_preds))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_preds))
# Print the RMSE values
print(f"Training RMSE: {train_rmse}")
print(f"Testing RMSE: {test_rmse}")
print(f"Validation RMSE: {val_rmse}")
# Calculate MAE for training/testing/validation data
train_mae = mean_absolute_error(y_train, y_train_preds)
val_mae = mean_absolute_error(y_val, y_val_preds)
test_mae = mean_absolute_error(y_test, y_test_preds)
# Calculate RΒ² for training/testing/validation data
train_r2 = r2_score(y_train, y_train_preds)
val_r2 = r2_score(y_val, y_val_preds)
test_r2 = r2_score(y_test, y_test_preds)
# Calculate MAPE for training/testing/validation data---
# We should not use MAPE here as 7,624 zero or near-zero values in the dataset, the MAPE calculation is likely encountering division by zero or near-zero situations, leading to the extremely high MAPE value.
# This makes MAPE an unreliable metric in this scenario.
# train_mape = mean_absolute_percentage_error(y_train, y_train_preds)
# test_mape = mean_absolute_percentage_error(y_test, y_test_preds)
# val_mape = mean_absolute_percentage_error(y_val, y_val_preds)
# Print MAE results
print(f"Training MAE: {train_mae}")
print(f"Validation MAE: {val_mae}")
print(f"Testing MAE: {test_mae}")
# Print R2 results
print(f"Training RΒ²: {train_r2}")
print(f"Validation R^2: {val_r2}")
print(f"Testing RΒ²: {test_r2}")
# Print MAPE results
# print(f"Training MAPE: {val_mape}")
# print(f"Validation MAPE: {val_mape}")
# print(f"Testing MAPE: {val_mape}")
# Residuals vs Predicted Plot
residuals = y_val - y_val_preds
plt.figure(figsize=(8, 6))
plt.scatter(y_val_preds, residuals, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Plot', fontweight="bold", fontsize = 16)
plt.show()
# Actual vs Predicted Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_val, y=y_val_preds, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Plot', fontweight="bold", fontsize = 16)
plt.show()
# Feature Importance Plot
# Fitting the OneHotEncoder separately to get the feature names
one_hot.fit(X_train[categorical_features])
feature_names = one_hot.get_feature_names_out(input_features=categorical_features)
feature_importances = best_model.feature_importances_
sorted_idx = feature_importances.argsort()
# Print sorted feature importances
print("Sorted Feature Importances:")
for idx in sorted_idx:
print(f"{feature_names[idx]}: {feature_importances[idx]}")
plt.figure(figsize=(10, 8))
plt.barh(range(len(sorted_idx)), feature_importances[sorted_idx], align='center')
plt.yticks(range(len(sorted_idx)), [feature_names[i] for i in sorted_idx])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance', fontweight="bold", fontsize = 16)
plt.show()
def plot_learning_curves(model, X_train, y_train, X_val, y_val, X_test, y_test, step=50, max_data_points=1000):
train_errors, val_errors, test_errors = [], [], []
# Use shape[0] to get the number of samples in the training set
n_train_samples = min(max_data_points, X_train.shape[0])
for m in range(1, n_train_samples, step):
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
y_test_predict = model.predict(X_test)
train_mse = mean_squared_error(y_train[:m], y_train_predict)
val_mse = mean_squared_error(y_val, y_val_predict)
test_mse = mean_squared_error(y_test, y_test_predict)
train_errors.append(train_mse)
val_errors.append(val_mse)
test_errors.append(test_mse)
plt.plot(np.sqrt(train_errors), label="Train")
plt.plot(np.sqrt(val_errors), label="Validation")
plt.plot(np.sqrt(test_errors), label="Test")
plt.xlabel("Training set size")
plt.ylabel("RMSE (Root Mean Squared Error)")
plt.legend()
plt.title("Training, Validation, and Test Loss Curves", fontweight="bold", fontsize = 16)
plt.show()
# Using the best_model that has been fitted to the entire training dataset
plot_learning_curves(best_model, X_train_transformed, y_train, X_val_transformed, y_val, X_test_transformed, y_test, step=50, max_data_points=1000)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 1.8s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 3.1s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 2.2s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 1.4s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 1.4s
Best Parameters: {'model__max_depth': 20, 'model__max_features': 'auto', 'model__min_samples_leaf': 1, 'model__n_estimators': 300}
Training MSE: 437377.9264733103
Validation MSE: 319438.8553918066
Testing MSE: 356705.32011998084
Training RMSE: 661.3455424158466
Testing RMSE: 597.2481227429524
Validation RMSE: 565.1892208736881
Training MAE: 175.07716535627378
Validation MAE: 165.5274255892406
Testing MAE: 165.91514613065357
Training RΒ²: 0.017094985432192344
Validation R^2: 0.01634354977512109
Testing RΒ²: 0.01752974402095664
Sorted Feature Importances: ABANDON_REASON_DESC_Input Error: 0.0 ABANDON_REASON_DESC_Duplicate Order: 0.0 ABANDON_REASON_DESC_Data Clean Up: 0.0 ABANDON_REASON_DESC_Alternative Job: 0.0 ABANDON_REASON_DESC_Added to Planned Programme: 0.0 ABANDON_REASON_DESC_Abortive Call: 0.0 ABANDON_REASON_DESC_No Charge: 0.0 ABANDON_REASON_DESC_No Access: 0.0 ABANDON_REASON_DESC_No Work Required: 0.0 ABANDON_REASON_DESC_See Repair Memo: 0.0 ABANDON_REASON_DESC_Tenant Missed Appt: 0.0 ABANDON_REASON_DESC_Tenant Refusal: 0.0 ABANDON_REASON_DESC_Testing: 0.0 ABANDON_REASON_DESC_Work Under Guarantee: 0.0 ABANDON_REASON_DESC_Wrong Contractor: 0.0 ABANDON_REASON_DESC_Riverside Not Approved: 0.0 ABANDON_REASON_DESC_Inspection Not Required: 0.0 Mgt Area_MA3: 0.00113545337916886 Property Type_0: 0.0012154739352853463 Property Type_Block No Shared Area: 0.003387141542098907 Property Type_End Terrace: 0.0036293039999416342 Property Type_Default: 0.00406396066775516 Property Type_Other Non-Rentable Space: 0.0052772907810005616 Property Type_Access via internal shared area: 0.006184444223272246 Property Type_Semi Detached: 0.009491845090013844 Mgt Area_MA2: 0.009948150111847633 Mgt Area_MA1: 0.01086664748430729 Property Type_Terrace: 0.012359186298313158 Property Type_Detached: 0.029649397849063956 Property Type_Access direct: 0.04564217056348101 ABANDON_REASON_DESC_nan: 0.8571495340744504
# Define predictor variables and response variable
predictors = ['Mgt Area']
response = 'Total Value'
# One-hot encode categorical variables
categorical_features = predictors
one_hot = OneHotEncoder(handle_unknown='ignore')
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")
# Split the data
X = int_df_copy[predictors]
y = int_df_copy[response]
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Transform the datasets
X_train_transformed = transformer.fit_transform(X_train)
X_test_transformed = transformer.transform(X_test)
X_val_transformed = transformer.transform(X_val)
# Create a Pipeline with Random Forest model
pipeline = Pipeline([
('transformer', transformer),
('model', RandomForestRegressor(random_state=42))
])
# Parameter distributions for Randomized Search
param_distributions = {
'model__n_estimators': [300],
'model__max_depth': [20],
'model__min_samples_leaf': [1],
'model__max_features': ['auto']
}
# Set up K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV for hyperparameter tuning
random_search = GridSearchCV(pipeline, param_distributions, cv=kfold,
scoring='neg_mean_squared_error', verbose=2)
random_search.fit(X_train, y_train)
# Best model and parameters
best_model = random_search.best_estimator_.named_steps['model']
best_params = random_search.best_params_
print(f"Best Parameters: {best_params}")
# Predict and evaluate on training, testing, and validation sets
y_train_preds = best_model.predict(X_train_transformed)
y_test_preds = best_model.predict(X_test_transformed)
y_val_preds = best_model.predict(X_val_transformed)
# Calculate MSE for each set
train_mse = mean_squared_error(y_train, y_train_preds)
test_mse = mean_squared_error(y_test, y_test_preds)
val_mse = mean_squared_error(y_val, y_val_preds)
# Print MSE results
print(f"Training MSE: {train_mse}")
print(f"Validation MSE: {val_mse}")
print(f"Testing MSE: {test_mse}")
# Additional evaluation metrics
# Calculate RMSE for the training/testing/validation set
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_preds))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_preds))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_preds))
# Print the RMSE values
print(f"Training RMSE: {train_rmse}")
print(f"Testing RMSE: {test_rmse}")
print(f"Validation RMSE: {val_rmse}")
# Calculate MAE for training/testing/validation data
train_mae = mean_absolute_error(y_train, y_train_preds)
val_mae = mean_absolute_error(y_val, y_val_preds)
test_mae = mean_absolute_error(y_test, y_test_preds)
# Calculate RΒ² for training/testing/validation data
train_r2 = r2_score(y_train, y_train_preds)
val_r2 = r2_score(y_val, y_val_preds)
test_r2 = r2_score(y_test, y_test_preds)
# Calculate MAPE for training/testing/validation data---
# We should not use MAPE here as 7,624 zero or near-zero values in the dataset, the MAPE calculation is likely encountering division by zero or near-zero situations, leading to the extremely high MAPE value.
# This makes MAPE an unreliable metric in this scenario.
# train_mape = mean_absolute_percentage_error(y_train, y_train_preds)
# test_mape = mean_absolute_percentage_error(y_test, y_test_preds)
# val_mape = mean_absolute_percentage_error(y_val, y_val_preds)
# Print MAE results
print(f"Training MAE: {train_mae}")
print(f"Validation MAE: {val_mae}")
print(f"Testing MAE: {test_mae}")
# Print R2 results
print(f"Training RΒ²: {train_r2}")
print(f"Validation R^2: {val_r2}")
print(f"Testing RΒ²: {test_r2}")
# Print MAPE results
# print(f"Training MAPE: {val_mape}")
# print(f"Validation MAPE: {val_mape}")
# print(f"Testing MAPE: {val_mape}")
# Residuals vs Predicted Plot
residuals = y_val - y_val_preds
plt.figure(figsize=(8, 6))
plt.scatter(y_val_preds, residuals, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Plot', fontweight="bold", fontsize = 16)
plt.show()
# Actual vs Predicted Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_val, y=y_val_preds, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Plot', fontweight="bold", fontsize = 16)
plt.show()
# Feature Importance Plot
# Fitting the OneHotEncoder separately to get the feature names
one_hot.fit(X_train[categorical_features])
feature_names = one_hot.get_feature_names_out(input_features=categorical_features)
feature_importances = best_model.feature_importances_
sorted_idx = feature_importances.argsort()
# Print sorted feature importances
print("Sorted Feature Importances:")
for idx in sorted_idx:
print(f"{feature_names[idx]}: {feature_importances[idx]}")
plt.figure(figsize=(10, 8))
plt.barh(range(len(sorted_idx)), feature_importances[sorted_idx], align='center')
plt.yticks(range(len(sorted_idx)), [feature_names[i] for i in sorted_idx])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance', fontweight="bold", fontsize = 16)
plt.show()
def plot_learning_curves(model, X_train, y_train, X_val, y_val, X_test, y_test, step=50, max_data_points=1000):
train_errors, val_errors, test_errors = [], [], []
# Use shape[0] to get the number of samples in the training set
n_train_samples = min(max_data_points, X_train.shape[0])
for m in range(1, n_train_samples, step):
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
y_test_predict = model.predict(X_test)
train_mse = mean_squared_error(y_train[:m], y_train_predict)
val_mse = mean_squared_error(y_val, y_val_predict)
test_mse = mean_squared_error(y_test, y_test_predict)
train_errors.append(train_mse)
val_errors.append(val_mse)
test_errors.append(test_mse)
plt.plot(np.sqrt(train_errors), label="Train")
plt.plot(np.sqrt(val_errors), label="Validation")
plt.plot(np.sqrt(test_errors), label="Test")
plt.xlabel("Training set size")
plt.ylabel("RMSE (Root Mean Squared Error)")
plt.legend()
plt.title("Training, Validation, and Test Loss Curves", fontweight="bold", fontsize = 16)
plt.show()
# Using the best_model that has been fitted to the entire training dataset
plot_learning_curves(best_model, X_train_transformed, y_train, X_val_transformed, y_val, X_test_transformed, y_test, step=50, max_data_points=1000)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 0.3s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 0.3s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 0.3s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 0.3s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 0.3s
Best Parameters: {'model__max_depth': 20, 'model__max_features': 'auto', 'model__min_samples_leaf': 1, 'model__n_estimators': 300}
Training MSE: 444973.09402960807
Validation MSE: 324897.33994944516
Testing MSE: 363165.52632497624
Training RMSE: 667.0630360240388
Testing RMSE: 602.6321650268729
Validation RMSE: 569.9976666175443
Training MAE: 190.35710870885674
Validation MAE: 177.2618657403592
Testing MAE: 180.60235062763655
Training RΒ²: 2.6615435190557868e-05
Validation R^2: -0.00046490496651774293
Testing RΒ²: -0.00026354384412008436
Sorted Feature Importances: Mgt Area_MA1: 0.22253107387404428 Mgt Area_MA2: 0.3662042941392214 Mgt Area_MA3: 0.41126463198673435
Notes: Quick Notes on Diagnostic Metrics:ΒΆ
MSE:ΒΆ
MSE gives a measure of how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the "errors") and squaring them. The squaring is necessary to remove any negative signs. It also gives more weight to larger differences. A small MSE suggests a tight fit of the model to the data.
RMSE:ΒΆ
Formula: RMSE =SQRT(MSE)
Formula: MSE = (1/n) * summation of (actual value of the i-th observation(yi) - predicted value for the i-th observation(y-hat)) ^2
MAE:ΒΆ
The Mean Absolute Error (MAE) is a measure used to evaluate how close predictions are to the outcomes. Unlike the Mean Squared Error (MSE), MAE measures the average magnitude of errors in a set of predictions, without considering their direction.
Formula: MAE = (1/n) * summation of (Absolute value of (actual value of the i-th observation(yi) - predicted value for the i-th observation(y-hat)) )
R2:ΒΆ
The R2 score is a measure of how well the variability in the response variable is explained by the model.
Formula: R2 = 1 - (Sum of Squares of Residuals (SSR) / Total Sum of Squares (SST) )
Comparing and contrasting different measures like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and R-squared in a model such as Random Forest is important because each of these metrics provides unique insights into the model's performance, and collectively, they offer a comprehensive
Model Performance Comparison(Random Forest):ΒΆ
Model-1: 11 Predictors:ΒΆ
Overfitting: This model shows signs of overfitting, as indicated by the high training RΒ² (0.882) compared to the lower testing RΒ² (0.541). It performs well on training data but less so on unseen data.
Generalisability: Moderate generalizability due to its reasonable performance on validation and testing sets, but there's a notable drop in performance compared to the training set.
Feature Influence: With 11 predictors, this model likely captures more complex relationships in the data. However, the large number of features might contribute to overfitting.
#############################################################################################################
Model-2: Consensus by at Least 2 Feature Selection Methods:ΒΆ
Overfitting: Slightly better balanced than Model-1, with less discrepancy between training and testing metrics. However, it still shows signs of overfitting.
Generalisability: Comparable to Model-1, with a slight decrease in testing RΒ² (0.531). The reduced number of features seems to have a limited impact on the model's ability to generalize.
Feature Influence: The selection of features based on consensus by at least two methods suggests a more focused approach, potentially including the most relevant predictors.
#############################################################################################################
Model-3: Consensus by at Least 3 Feature Selection MethodsΒΆ
Overfitting: Significantly reduced compared to Model-1 and Model-2, as indicated by the closer performance metrics between training and testing. However, the overall performance is much lower. Generalisability: Poor generalizability. The model has low RΒ² values across all datasets, indicating it does not predict well. Feature Influence: The limited number of predictors (only three) may not capture enough complexity in the data, leading to underfitting.
#############################################################################################################
Model-4: Consensus by All 4 Feature Selection MethodsΒΆ
Overfitting: Virtually no overfitting, indicated by similar performance across training, validation, and testing. However, the model's overall performance is very poor. Generalisability: Very poor, with negative RΒ² values in validation and testing, suggesting that the model is worse than a simple average. Feature Influence: Relying on a single predictor ('Mgt Area') is overly simplistic, leading to a model that does not capture the necessary information to make accurate predictions.
#############################################################################################################
Quick Summary:ΒΆ
Model-1 demonstrates the highest potential for accurate predictions but needs adjustments to reduce overfitting. Model-2 balances complexity and overfitting better than Model-1, but with a slight compromise in performance. Model-3 and Model-4 are overly simplified, leading to underfitting and poor prediction capabilities.
#################################################################################
Model-1 (11 predictors)ΒΆ
Overfitting: More pronounced, as indicated by a significant drop in RΒ² from training (0.900) to testing (0.747).
This suggests that the model fits the training data well but struggles to maintain the same level of performance on unseen data. Generalizability: While the model generalizes reasonably, the gap in performance metrics between training and testing datasets hints at its limitations in dealing with unseen data.
Model-2 (Ensemble mix - 2+ feature selection techniques)ΒΆ
Overfitting: Less pronounced compared to Model-1. The decrease in RΒ² from training (0.847) to testing (0.716) is smaller, suggesting a more balanced model that doesn't excessively tailor itself to the training data.
Generalizability: Despite slightly lower RΒ² values in testing compared to Model-1, the reduced overfitting implies a better ability to adapt to unseen data. The smaller gap between training and testing performance metrics is indicative of a more robust model.
Model-3 (Ensemble mix - 3+ feature selection techniques)ΒΆ
Overfitting: This model shows minimal overfitting, indicated by low RΒ² values across training, validation, and testing datasets. Generalizability: The model generalizes poorly, with very low RΒ² values (around 0.017), suggesting it's not capturing the patterns in the data effectively. Key Features: The limited number of predictors (3) significantly reduces the model's ability to capture the complex relationships in the data.
Model-4 (Ensemble mix - all 4 feature selection techniques)ΒΆ
Overfitting: This model shows almost no overfitting, but like Model-3, the performance is poor across all datasets. Generalizability: The model's generalizability is the lowest among all models, as indicated by negative RΒ² values in validation and testing datasets. Key Features: Relying on a single predictor ('Mgt Area') severely limits the model's predictive power.
###########################################################################################
Model Comparative Analysis (first 2 best models of ensembled features execution)ΒΆ
Model-1 vs Model-2: Model-1 might capture more complexities due to its larger number of predictors, but this also makes it more prone to overfitting.
Model-2, with fewer predictors, strikes a better balance between fitting the training data and maintaining performance on new data. Feature Selection Impact: The feature selection in Model-2 seems to be effective in retaining significant predictors while avoiding the over-complexity seen in Model-1. This results in a model that is slightly less accurate but more reliable when applied to unseen data. ############################################################################################
General Notes:ΒΆ
For scenarios where the balance between accuracy and generalizability is crucial, Model-2 appears to be a better choice. It offers a good mix of predictive power and adaptability to new data, making it potentially more useful in real-world applications where overfitting can be a significant concern.
In summary, while Model-1 has a slight edge in accuracy, Model-2's reduced overfitting and closer performance metrics between training and testing datasets make it more suitable for scenarios where generalizability is key.
Key Business Metrics Drivers behind Random Forest Model's ability to predict Total Repair Costs:ΒΆ
Model-1 (11 Predictors):ΒΆ
Key Features: This model contains a comprehensive set of features, including both "Property Type" and "Mgt Area". Analysis: The model shows a high degree of overfitting. This could be due to the inclusion of a wide range of features, which might capture more variance in the training data but adversely affect the model's ability to generalize to unseen data.
####################################################################################################################
Model-2 (Consensus by at least 2 Feature Selection Techniques):ΒΆ
Key Features: This model also includes "Property Type" and "Mgt Area", along with other features selected by at least two feature selection techniques. Analysis: The performance metrics of Model-2 suggest better generalizability compared to Model-1, with less overfitting. This indicates that the selected features in this model are potentially more relevant and impactful in predicting repair costs.
#########################################################################################################################
Model-3 (Consensus by at least 3 Feature Selection Techniques):ΒΆ
Key Features: "Property Type" and "Mgt Area" are present, along with "ABANDON_REASON_DESC". Analysis: The significant drop in performance metrics in this model indicates that these three features alone are not sufficient to capture the complexity of repair costs, leading to underfitting.
#################################################################################################################
Model-4 (Consensus by all 4 Feature Selection Techniques):ΒΆ
Key Features: Only "Mgt Area" is used. Analysis: The reliance on a single feature results in the poorest performance, suggesting that 'Mgt Area' alone is not a strong predictor for repair costs.
#########################################################################################################################
Conclusion:ΒΆ
The inclusion of "Property Type" and "Mgt Area" in all models highlights their perceived importance. However, Model-4 demonstrates that relying solely on "Mgt Area" is inadequate. The balanced approach in Model-2, which includes these two features along with others chosen through consensus, seems most effective. This model strikes a good balance between capturing enough complexity to perform well on training data and maintaining the ability to generalize to new data.
int_df_copy = Int_df_merged.copy()
# Explicitly specify the format for 'Date Comp' and 'Date Logged'
int_df_copy['Date Comp'] = pd.to_datetime(int_df_copy['Date Comp'], format='%d/%m/%Y')
int_df_copy['Date Logged'] = pd.to_datetime(int_df_copy['Date Logged'], format='%d/%m/%Y')
# Calculate 'Task_completion_time' in days
int_df_copy['Task_completion_time'] = (int_df_copy['Date Comp'] - int_df_copy['Date Logged']).dt.days
# Check for null or NaN values in 'Task_completion_time'
null_or_nan = int_df_copy['Task_completion_time'].isnull().sum()
# Display the count of null or NaN values
print(f"Number of null or NaN values in 'Task_completion_time': {null_or_nan}")
# Count the number of negative, positive, and zero values
num_negative = (int_df_copy['Task_completion_time'] < 0).sum()
num_positive = (int_df_copy['Task_completion_time'] > 0).sum()
num_zero = (int_df_copy['Task_completion_time'] == 0).sum()
# Print the counts
print(f"Number of negative values in 'Task_completion_time': {num_negative}")
print(f"Number of positive values in 'Task_completion_time': {num_positive}")
print(f"Number of zero values in 'Task_completion_time': {num_zero}")
# Display the DataFrame to verify the results
# int_df_copy.head()
Number of null or NaN values in 'Task_completion_time': 790 Number of negative values in 'Task_completion_time': 12 Number of positive values in 'Task_completion_time': 13505 Number of zero values in 'Task_completion_time': 6979
Issues with Data Imputation --(Missing at Random (MAR) or Missing not at Random (MNAR)ΒΆ
1- Here "Date Comp" of which 790 records have 'Nan' values out of 21285 records.
Analysing the Distribution of Missing DataΒΆ
1- (proportion of missing 'Task_completion_time')
Missing Proportion of dataΒΆ
# Create a missing indicator for 'Task_completion_time'
int_df_copy['missing_task_completion_time'] = int_df_copy['Task_completion_time'].isna()
# Group by 'JOB_TYPE_DESCRIPTION' and calculate the proportion of missing 'Task_completion_time'
missing_proportion = int_df_copy.groupby('JOB_TYPE_DESCRIPTION')['missing_task_completion_time'].mean().sort_values(ascending=False)
# Create a DataFrame sorted by the proportion of missing data
missing_proportion_df = missing_proportion.sort_values(ascending=False).reset_index()
missing_proportion_df.columns = ['JOB_TYPE_DESCRIPTION', 'Proportion_Missing']
# Display the DataFrame
print(missing_proportion_df)
# Plotting the proportion of missing 'Task_completion_time' by 'JOB_TYPE_DESCRIPTION'
plt.figure(figsize=(12, 10))
sns.barplot(x=missing_proportion.values, y=missing_proportion.index)
plt.xlabel('Proportion Missing Task Completion Time')
plt.title('Proportion of Missing Task Completion Time by JOB TYPE DESCRIPTION', fontweight="bold", fontsize=16)
plt.tight_layout() # Adjust the plot to ensure the labels fit well
plt.show()
JOB_TYPE_DESCRIPTION Proportion_Missing 0 Tenant Doing Own Repair 1.000000 1 Play Equipment Inspections 1.000000 2 Pre-Inspection 0.757576 3 Play Equipment Repairs 0.750000 4 Asbestos Repairs Communal 0.333333 5 Water Risk Inspection 0.300000 6 Gate and Barrier Repairs 0.250000 7 Suspected Damp 0.179521 8 Commercial Lifts Inspections 0.145299 9 Asbestos Repairs Void 0.142857 10 Water Hygiene Inspections 0.111111 11 Schedule Repairs Visit 0.111111 12 Aids and Adaptations 0.111111 13 Communal Area Building Safety Inspection 0.105960 14 Door Inspection and Repairs 0.105263 15 Asbestos Inspection Reactive 0.101852 16 Door Access Control Repairs & Service 0.101695 17 PAT Testing 0.095238 18 Warden Call Equipment Repairs 0.086957 19 Asbestos Inspections Planned 0.076923 20 Domestic Lifts Repairs 0.057143 21 Communal Responsive Repairs 0.054348 22 Section 11 Repairs 0.050000 23 Fire Risk Repairs Planned 0.048780 24 Domestic Lifts Inspections 0.048780 25 Lifts Consultants 0.043478 26 Void Repairs 0.043384 27 Asbestos Inspections Void 0.041667 28 Responsive Repairs 0.027838 29 Gas Responsive Repairs 0.021058 30 Communal Gas Repairs 0.018634 31 Fire Safety Equipment Inspections 0.016129 32 Fire Safety Equipment Repairs 0.010204 33 Asbestos Inspection Communal 0.000000 34 Asbestos Repairs Planned 0.000000 35 Asbestos Repairs Reactive 0.000000 36 Communal Gas Inspections 0.000000 37 Fire Risk Repairs 0.000000 38 Water Hygiene Repairs 0.000000 39 Lightning Conductors and Fall Safety Rep 0.000000 40 Rechargeable Repairs 0.000000 41 Gas Exclusion 0.000000 42 XXXXXAsbestos Repairs 0.000000 43 XXXXXXAsbestos Inspections 0.000000
Data Imputation Impact-- Implications for Removing Missing data:ΒΆ
Removing records with missing 'Task_completion_time' could potentially bias the analysis, especially since the missing data appears to be non-random and varies significantly across different 'JOB_TYPE_DESCRIPTION' categories.
1- Selective Removal of Data:
Removing records with missing 'Task_completion_time', you might be disproportionately removing certain types of jobs. For instance, if 'Tenant Doing Own Repair' and 'Play Equipment Inspections' have a 100% missing rate, these job types would be entirely excluded from the analysis. This could skew the results, as the analysis would then only represent job types where completion time is consistently recorded. Potential Loss of Important Information:
3- By removing these records, we will lose valuable insights from job types that are significant in number or have particular importance from an operational or business perspective. The analysis would not capture the full spectrum of jobs managed, potentially leading to conclusions that aren't representative of the overall data.
3- Biased Understanding of the Process:
The pattern of missing data might itself be an important insight. It could indicate issues or peculiarities in the operational process or data collection methods for certain job types. By excluding these records, you might miss the opportunity to identify and address underlying issues that lead to missing data.
1- This variability indicates that the missingness is related to the type of job, which is a pattern that suggests it's not random.
2- Analysis:
This non-random pattern of missing data suggests that simply excluding these records or imputing them without considering the job type might introduce bias into the analysis. Understanding why certain job types have higher rates of missing 'Task_completion_time' is important. This could inform how you handle these missing values and might also provide insights into operational aspects that could be optimized.
Data Imputation (Date Completion Field):ΒΆ
############################################################################################ We found earlier chi-square test of independence that , "Date Comp" is either MAR(Missing at Random) or MANR (Missing Not at Random).
MAR occurs when the propensity for a data point to be missing is related to observed data and not the missing data itself. MNAR occurs when the propensity for a data point to be missing is related to the missing data.
For MCAR data, most imputation methods are valid. For MAR data, methods that model the probability of missingness given observed data can be used. However, imputing MNAR data accurately is often more complex and requires additional information or assumptions. By grouping MAR and MNAR, the analysis highlights cases where standard imputation techniques might be inadequate or require more careful consideration. ################################################################################################## Contextual Knowledge Required:
Distinguishing MAR from MNAR often requires detailed knowledge about the context and the data collection process, which might not be available purely from the data. ##################################################################################################
Conclusion: (Caveat behind dropping of records)ΒΆ
Without any contextual reason or lack of knowledge behind missingness of those "Date Comp" (=790) records, we have only option to drop those records at the moment though being fully aware of the fact that it would potentially introduce random bias into the data if high proportoniate of those missing records belong to a single category, which would eventually be fully excluded from tha data analysis .
# Drop rows where 'Task_completion_time' is NaN
int_df_copy_cleaned = int_df_copy.dropna(subset=['Task_completion_time'])
# Check for missing records in 'Task_completion_time'
missing_records_after_cleaning = int_df_copy_cleaned['Task_completion_time'].isna().sum()
print("Missing records count after dropping of missing Task completion time records:")
print(missing_records_after_cleaning)
Missing records count after dropping of missing Task completion time records: 0
Feature Importance Analysis for Response Variable ("Total Cost")ΒΆ
# Selected columns
selected_columns = ['JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Property Type', 'Jobsourcedescription',
'Initial Priority Description', 'Latest Priority Description', 'JOB_STATUS_DESCRIPTION',
'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC', 'SOR_DESCRIPTION', 'Mgt Area', 'Task_completion_time',
'Total Value']
# Create the new DataFrame with relevant columns
# data_relevant = int_df_copy[selected_columns]
# data_relevant = data_relevant.dropna(subset=['Task_completion_time'])
# data_relevant = data_relevant.dropna(subset=['Task_completion_time'])
# Create the new DataFrame with relevant columns
# Explicitly creating a copy of the slice to avoid SettingWithCopyWarning
data_relevant = int_df_copy_cleaned[selected_columns].copy()
# Encoding categorical variables
label_encoders = {}
for column in data_relevant.select_dtypes(include=['object']).columns:
label_encoder = LabelEncoder()
# Encoding and directly assigning to the DataFrame copy
data_relevant[column] = label_encoder.fit_transform(data_relevant[column].astype(str))
label_encoders[column] = label_encoder
# Imputing missing values
imputer = SimpleImputer(strategy='mean')
# Applying imputer and ensuring the result is stored in the same DataFrame variable
data_relevant = pd.DataFrame(imputer.fit_transform(data_relevant), columns=data_relevant.columns)
# Separating the target variable and features
# X = data_relevant.drop('Task_completion_time', axis=1)
# y = data_relevant['Task_completion_time']
X = data_relevant.drop('Total Value', axis=1)
y = data_relevant['Total Value']
# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Model training
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)
# Feature importance
feature_importance = model.feature_importances_
sorted_idx = np.argsort(feature_importance)
feature_names_sorted = X.columns[sorted_idx]
importance_sorted = feature_importance[sorted_idx]
# Plotting
plt.figure(figsize=(6, 4)) # Larger figure size
bars = plt.barh(feature_names_sorted, importance_sorted)
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Random Forest Regressor - Feature Importance for Predicting Total Repair Cost')
# Adding simplified text annotations on top of each bar
for bar in bars:
if bar.get_width() > 0: # Only annotate bars wider than a threshold
plt.text(bar.get_width(), bar.get_y() + bar.get_height() / 2,
f'{bar.get_width():.2f}', # Rounded to two decimal places
va='center', ha='left') # Adjust text alignment if needed
plt.show()
# Create a DataFrame to display the feature names and importances
feature_importance_RF_df = pd.DataFrame({'Feature': feature_names_sorted, 'Importance': importance_sorted})
feature_importance_RF_df = feature_importance_RF_df.sort_values(by='Importance', ascending=False)
# Display the feature names and importances as a DataFrame
print("Random Forest Regressor - Feature Importance:")
print(feature_importance_RF_df)
#######################################################################################################################
# ANOVA F-value feature selection
f_values, p_values = f_classif(X_train, y_train)
# Sorting indices by feature importance
sorted_indices = np.argsort(f_values)[::-1][:-1]
# Create a DataFrame to display the feature names and ANOVA F-values
feature_importance_anova_df = pd.DataFrame({'Feature': X_train.columns[sorted_indices], 'ANOVA F-value': f_values[sorted_indices]})
# Plotting
plt.figure(figsize=(6, 4))
# ANOVA F-values
plt.barh(range(len(sorted_indices)), f_values[sorted_indices][::-1], color='skyblue') # Reverse the order
plt.yticks(range(len(sorted_indices)), X_train.columns[sorted_indices][::-1]) # Reverse the order
plt.xlabel('ANOVA F-value')
plt.ylabel('Features')
plt.title('ANOVA F-value - Feature Importance')
# Display descending scores below the plot
for i, value in enumerate(f_values[sorted_indices][::-1]): # Reverse the order
plt.text(value, i, f'{value:.2f}', va='center', fontsize=8)
plt.tight_layout()
plt.show()
# Display the feature names and ANOVA F-values as a DataFrame
print("ANOVA F-value - Feature Importance:")
print(feature_importance_anova_df)
#######################################################################################################################
# Convert X_test to dense array for permutation importance
# X_test_dense = X_test.toarray()
# Permutation Importance
#Here 'model' is the trained RandomForestRegressor and 'X_test, y_test' are test data
perm_importance = permutation_importance(model, X_test, y_test, n_repeats=40, random_state=42)
# Permutation Importance
plt.figure(figsize=(6, 4)) # Adjust the figure size as needed
plt.subplot(1, 1, 1) # Modify subplot parameters if needed
sorted_idx_perm = perm_importance.importances_mean.argsort()
plt.barh(range(len(sorted_idx_perm)), perm_importance.importances_mean[sorted_idx_perm])
plt.yticks(range(len(sorted_idx_perm)), X_test.columns[sorted_idx_perm]) # Add y-axis labels
plt.title('Permutation Importance')
# Display descending scores below the plot
for i, value in enumerate(perm_importance.importances_mean[sorted_idx_perm]):
plt.text(value, i, f'{value:.4f}', va='center', fontsize=8)
plt.tight_layout()
plt.show()
# Create a DataFrame to display the feature names and permutation importances
perm_importance_df = pd.DataFrame({'Feature': X_test.columns[sorted_idx_perm], 'Permutation Importance': perm_importance.importances_mean[sorted_idx_perm]})
perm_importance_df = perm_importance_df.sort_values(by='Permutation Importance', ascending=False)
# Display the feature names and permutation importances as a DataFrame
print("Permutation- Feature Importance:")
print(perm_importance_df)
#######################################################################################################################
# X_train, y_train' are the training datasets
linear_model = LinearRegression()
rfe = RFE(estimator=linear_model, n_features_to_select=5)
rfe.fit(X_train, y_train)
# Create a DataFrame to store the feature names and rankings
rfe_ranking_df = pd.DataFrame({'Feature': X_train.columns, 'Ranking': rfe.ranking_})
# Sort the DataFrame by RFE rankings
rfe_ranking_df_sorted = rfe_ranking_df.sort_values(by='Ranking')
# Plotting
plt.figure(figsize=(10, 6)) # Adjust the figure size as needed
# Bar plot
plt.subplot(2, 1, 1) # Updated subplot to accommodate two plots
plt.barh(range(len(rfe_ranking_df_sorted)), rfe_ranking_df_sorted['Ranking'])
plt.yticks(range(len(rfe_ranking_df_sorted)), rfe_ranking_df_sorted['Feature']) # Add y-axis labels
plt.xlabel('RFE Feature Ranking (Lower is Better)')
plt.title('RFE Feature Ranking')
# Display the rankings on the plot
for i, value in enumerate(rfe_ranking_df_sorted['Ranking']):
plt.text(value, i, f'{value}', va='center', fontsize=8)
plt.tight_layout() # Adjust layout
plt.show()
# Display the sorted DataFrame below the plot
print("\nRFE Feature Ranking DataFrame (Sorted):")
print(rfe_ranking_df_sorted)
Random Forest Regressor - Feature Importance:
Feature Importance
11 Jobsourcedescription 0.330431
10 Initial Priority Description 0.171780
9 SOR_DESCRIPTION 0.136571
8 Task_completion_time 0.118702
7 JOB_TYPE_DESCRIPTION 0.087115
6 Property Type 0.044954
5 JOB_STATUS_DESCRIPTION 0.035220
4 TRADE_DESCRIPTION 0.031162
3 ABANDON_REASON_DESC 0.022196
2 Latest Priority Description 0.017337
1 Mgt Area 0.003170
0 CONTRACTOR 0.001362
ANOVA F-value - Feature Importance:
Feature ANOVA F-value
0 CONTRACTOR 9.496860
1 JOB_TYPE_DESCRIPTION 6.925849
2 JOB_STATUS_DESCRIPTION 3.604884
3 ABANDON_REASON_DESC 3.003460
4 TRADE_DESCRIPTION 2.450702
5 Task_completion_time 2.116972
6 SOR_DESCRIPTION 2.105041
7 Mgt Area 2.025992
8 Latest Priority Description 1.743259
9 Property Type 1.369187
10 Jobsourcedescription 1.260883
Permutation- Feature Importance:
Feature Permutation Importance
11 Jobsourcedescription 2.166194
10 JOB_TYPE_DESCRIPTION 1.175952
9 Initial Priority Description 0.352254
8 Latest Priority Description 0.119766
7 SOR_DESCRIPTION 0.024656
6 TRADE_DESCRIPTION 0.020291
5 ABANDON_REASON_DESC 0.013891
4 Task_completion_time 0.013369
3 Property Type 0.004914
2 CONTRACTOR 0.002090
1 Mgt Area -0.000055
0 JOB_STATUS_DESCRIPTION -0.041079
RFE Feature Ranking DataFrame (Sorted):
Feature Ranking
0 JOB_TYPE_DESCRIPTION 1
3 Jobsourcedescription 1
5 Latest Priority Description 1
6 JOB_STATUS_DESCRIPTION 1
10 Mgt Area 1
2 Property Type 2
8 ABANDON_REASON_DESC 3
11 Task_completion_time 4
4 Initial Priority Description 5
7 TRADE_DESCRIPTION 6
1 CONTRACTOR 7
9 SOR_DESCRIPTION 8
feature_importance_all
| Feature | RF | ANOVA | Permutation | RFE | RF_normalized | ANOVA_normalized | Permutation_normalized | RFE_normalized | RF_important | ANOVA_important | Permutation_important | RFE_important | Total_important | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | JOB_TYPE_DESCRIPTION | 4.0 | 2.0 | 2.0 | 3.0 | 0.363636 | 0.2 | 0.181818 | 0.750000 | 0 | 0 | 0 | 1 | 1 |
| 1 | CONTRACTOR | 11.0 | 1.0 | 10.0 | 8.0 | 1.000000 | 0.1 | 0.909091 | 0.333333 | 1 | 0 | 1 | 0 | 2 |
| 2 | Property Type | 7.0 | 9.0 | 9.0 | 9.0 | 0.636364 | 0.9 | 0.818182 | 0.250000 | 1 | 1 | 1 | 0 | 3 |
| 3 | Jobsourcedescription | 1.0 | 10.0 | 1.0 | 3.0 | 0.090909 | 1.0 | 0.090909 | 0.750000 | 0 | 1 | 0 | 1 | 2 |
| 4 | Initial Priority Description | 2.0 | NaN | 3.0 | 6.0 | 0.181818 | 0.0 | 0.272727 | 0.500000 | 0 | 0 | 0 | 1 | 1 |
| 5 | Latest Priority Description | 8.0 | 8.0 | 5.0 | 7.0 | 0.727273 | 0.8 | 0.454545 | 0.416667 | 1 | 1 | 0 | 0 | 2 |
| 6 | JOB_STATUS_DESCRIPTION | 5.0 | 4.0 | 6.0 | 3.0 | 0.454545 | 0.4 | 0.545455 | 0.750000 | 0 | 0 | 1 | 1 | 2 |
| 7 | TRADE_DESCRIPTION | 6.0 | 5.0 | 7.0 | 10.0 | 0.545455 | 0.5 | 0.636364 | 0.166667 | 1 | 0 | 1 | 0 | 2 |
| 8 | ABANDON_REASON_DESC | 9.0 | 3.0 | 8.0 | 3.0 | 0.818182 | 0.3 | 0.727273 | 0.750000 | 1 | 0 | 1 | 1 | 3 |
| 9 | SOR_DESCRIPTION | 3.0 | 6.0 | 4.0 | 11.0 | 0.272727 | 0.6 | 0.363636 | 0.083333 | 0 | 1 | 0 | 0 | 1 |
| 10 | Mgt Area | 10.0 | 7.0 | 11.0 | 3.0 | 0.909091 | 0.7 | 1.000000 | 0.750000 | 1 | 1 | 1 | 1 | 4 |
#############################################################################################
Ensemble Mix of best feature selection (based on majority voting consesus) between different feature selection techniquesΒΆ
Step 1: Determine the Consensus on Important Features threshold for how many methods need to agree on a feature being important. For this example, let's say a feature is considered important if at least 3 out of the 4 methods agree.
############################################################################################################################
# Normalize importance scores
feature_importance_all['RF_normalized'] = feature_importance_all['RF'] / feature_importance_all['RF'].max()
feature_importance_all['ANOVA_normalized'] = feature_importance_all['ANOVA'].fillna(0) / feature_importance_all['ANOVA'].max()
feature_importance_all['Permutation_normalized'] = feature_importance_all['Permutation'] / feature_importance_all['Permutation'].max()
# Invert RFE ranks for normalization
rfe_max = feature_importance_all['RFE'].max() + 1
feature_importance_all['RFE_normalized'] = (rfe_max - feature_importance_all['RFE']) / rfe_max
print(feature_importance_all)
# Plotting stacked bar chart
plt.figure(figsize=(8, 6))
for i, row in feature_importance_all.iterrows():
plt.bar(row['Feature'], height=row['RF_normalized'], color='b', edgecolor='black', label='RF' if i == 0 else "")
plt.bar(row['Feature'], height=row['ANOVA_normalized'], bottom=row['RF_normalized'], color='g', edgecolor='black', label='ANOVA' if i == 0 else "")
plt.bar(row['Feature'], height=row['Permutation_normalized'], bottom=row['RF_normalized'] + row['ANOVA_normalized'], color='r', edgecolor='black', label='Permutation' if i == 0 else "")
plt.bar(row['Feature'], height=row['RFE_normalized'], bottom=row['RF_normalized'] + row['ANOVA_normalized'] + row['Permutation_normalized'], color='y', edgecolor='black', label='RFE' if i == 0 else "")
plt.xlabel('Feature')
plt.xticks(rotation=45, ha='right')
plt.ylabel('Normalized Importance')
plt.title('Feature Importance Comparison (Stacked Bar Chart)', fontweight = "bold", fontsize = 16)
plt.legend()
plt.show()
Feature RF ANOVA Permutation RFE \
0 JOB_TYPE_DESCRIPTION 4.0 2.0 2.0 3.0
1 CONTRACTOR 11.0 1.0 10.0 8.0
2 Property Type 7.0 9.0 9.0 9.0
3 Jobsourcedescription 1.0 10.0 1.0 3.0
4 Initial Priority Description 2.0 NaN 3.0 6.0
5 Latest Priority Description 8.0 8.0 5.0 7.0
6 JOB_STATUS_DESCRIPTION 5.0 4.0 6.0 3.0
7 TRADE_DESCRIPTION 6.0 5.0 7.0 10.0
8 ABANDON_REASON_DESC 9.0 3.0 8.0 3.0
9 SOR_DESCRIPTION 3.0 6.0 4.0 11.0
10 Mgt Area 10.0 7.0 11.0 3.0
RF_normalized ANOVA_normalized Permutation_normalized RFE_normalized \
0 0.363636 0.2 0.181818 0.750000
1 1.000000 0.1 0.909091 0.333333
2 0.636364 0.9 0.818182 0.250000
3 0.090909 1.0 0.090909 0.750000
4 0.181818 0.0 0.272727 0.500000
5 0.727273 0.8 0.454545 0.416667
6 0.454545 0.4 0.545455 0.750000
7 0.545455 0.5 0.636364 0.166667
8 0.818182 0.3 0.727273 0.750000
9 0.272727 0.6 0.363636 0.083333
10 0.909091 0.7 1.000000 0.750000
RF_important ANOVA_important Permutation_important RFE_important \
0 0 0 0 1
1 1 0 1 0
2 1 1 1 0
3 0 1 0 1
4 0 0 0 1
5 1 1 0 0
6 0 0 1 1
7 1 0 1 0
8 1 0 1 1
9 0 1 0 0
10 1 1 1 1
Total_important
0 1
1 2
2 3
3 2
4 1
5 2
6 2
7 2
8 3
9 1
10 4
Tracking of feature selection consesus separately by 1, 2,3 or all methodsΒΆ
# consensus_1FS_methods: Features important in at least 1 method
consensus_1FS_method= feature_importance_all[feature_importance_all['Total_important'] >= 1]
# consensus_2FS_methods: Features important in at least 2 methods
consensus_2FS_methods = feature_importance_all[feature_importance_all['Total_important'] >= 2]
# consensus_3FS_methods: Features important in at least 3 methods
consensus_3FS_methods = feature_importance_all[feature_importance_all['Total_important'] >= 3]
# consensus_4FS_methods: Features important in at least 3 methods
consensus_4FS_methods = feature_importance_all[feature_importance_all['Total_important'] >= 4]
print("Consensus by at least 1 Feature Selection Methods")
print(consensus_1FS_method)
print("Consensus by at least 2 Feature Selection Methods")
print(consensus_2FS_methods)
print("Consensus by at least 3 Feature Selection Methods")
print(consensus_3FS_methods)
print("Consensus by at least 4 Feature Selection Methods")
print(consensus_4FS_methods)
Consensus by at least 1 Feature Selection Methods
Feature RF ANOVA Permutation RFE \
0 JOB_TYPE_DESCRIPTION 4.0 2.0 2.0 3.0
1 CONTRACTOR 11.0 1.0 10.0 8.0
2 Property Type 7.0 9.0 9.0 9.0
3 Jobsourcedescription 1.0 10.0 1.0 3.0
4 Initial Priority Description 2.0 NaN 3.0 6.0
5 Latest Priority Description 8.0 8.0 5.0 7.0
6 JOB_STATUS_DESCRIPTION 5.0 4.0 6.0 3.0
7 TRADE_DESCRIPTION 6.0 5.0 7.0 10.0
8 ABANDON_REASON_DESC 9.0 3.0 8.0 3.0
9 SOR_DESCRIPTION 3.0 6.0 4.0 11.0
10 Mgt Area 10.0 7.0 11.0 3.0
RF_normalized ANOVA_normalized Permutation_normalized RFE_normalized \
0 0.363636 0.2 0.181818 0.750000
1 1.000000 0.1 0.909091 0.333333
2 0.636364 0.9 0.818182 0.250000
3 0.090909 1.0 0.090909 0.750000
4 0.181818 0.0 0.272727 0.500000
5 0.727273 0.8 0.454545 0.416667
6 0.454545 0.4 0.545455 0.750000
7 0.545455 0.5 0.636364 0.166667
8 0.818182 0.3 0.727273 0.750000
9 0.272727 0.6 0.363636 0.083333
10 0.909091 0.7 1.000000 0.750000
RF_important ANOVA_important Permutation_important RFE_important \
0 0 0 0 1
1 1 0 1 0
2 1 1 1 0
3 0 1 0 1
4 0 0 0 1
5 1 1 0 0
6 0 0 1 1
7 1 0 1 0
8 1 0 1 1
9 0 1 0 0
10 1 1 1 1
Total_important
0 1
1 2
2 3
3 2
4 1
5 2
6 2
7 2
8 3
9 1
10 4
Consensus by at least 2 Feature Selection Methods
Feature RF ANOVA Permutation RFE \
1 CONTRACTOR 11.0 1.0 10.0 8.0
2 Property Type 7.0 9.0 9.0 9.0
3 Jobsourcedescription 1.0 10.0 1.0 3.0
5 Latest Priority Description 8.0 8.0 5.0 7.0
6 JOB_STATUS_DESCRIPTION 5.0 4.0 6.0 3.0
7 TRADE_DESCRIPTION 6.0 5.0 7.0 10.0
8 ABANDON_REASON_DESC 9.0 3.0 8.0 3.0
10 Mgt Area 10.0 7.0 11.0 3.0
RF_normalized ANOVA_normalized Permutation_normalized RFE_normalized \
1 1.000000 0.1 0.909091 0.333333
2 0.636364 0.9 0.818182 0.250000
3 0.090909 1.0 0.090909 0.750000
5 0.727273 0.8 0.454545 0.416667
6 0.454545 0.4 0.545455 0.750000
7 0.545455 0.5 0.636364 0.166667
8 0.818182 0.3 0.727273 0.750000
10 0.909091 0.7 1.000000 0.750000
RF_important ANOVA_important Permutation_important RFE_important \
1 1 0 1 0
2 1 1 1 0
3 0 1 0 1
5 1 1 0 0
6 0 0 1 1
7 1 0 1 0
8 1 0 1 1
10 1 1 1 1
Total_important
1 2
2 3
3 2
5 2
6 2
7 2
8 3
10 4
Consensus by at least 3 Feature Selection Methods
Feature RF ANOVA Permutation RFE RF_normalized \
2 Property Type 7.0 9.0 9.0 9.0 0.636364
8 ABANDON_REASON_DESC 9.0 3.0 8.0 3.0 0.818182
10 Mgt Area 10.0 7.0 11.0 3.0 0.909091
ANOVA_normalized Permutation_normalized RFE_normalized RF_important \
2 0.9 0.818182 0.25 1
8 0.3 0.727273 0.75 1
10 0.7 1.000000 0.75 1
ANOVA_important Permutation_important RFE_important Total_important
2 1 1 0 3
8 0 1 1 3
10 1 1 1 4
Consensus by at least 4 Feature Selection Methods
Feature RF ANOVA Permutation RFE RF_normalized ANOVA_normalized \
10 Mgt Area 10.0 7.0 11.0 3.0 0.909091 0.7
Permutation_normalized RFE_normalized RF_important ANOVA_important \
10 1.0 0.75 1 1
Permutation_important RFE_important Total_important
10 1 1 4
User custom selection of either individual feature selection technique or the consensus threshold valueΒΆ
Note - This is a Dash-Interactive app. Please note that for interaction it needs to be hosted on some service providers platform as a standalone web application.
# Normalize importance scores and invert RFE ranks
feature_importance_all['RF_normalized'] = feature_importance_all['RF'] / feature_importance_all['RF'].max()
feature_importance_all['ANOVA_normalized'] = feature_importance_all['ANOVA'].fillna(0) / feature_importance_all['ANOVA'].max()
feature_importance_all['Permutation_normalized'] = feature_importance_all['Permutation'] / feature_importance_all['Permutation'].max()
rfe_max = feature_importance_all['RFE'].max() + 1
feature_importance_all['RFE_normalized'] = (rfe_max - feature_importance_all['RFE']) / rfe_max
app = dash.Dash(__name__)
app.layout = html.Div([
html.H4("Feature Selection Technique:", style={'textAlign': 'center'}),
dcc.Dropdown(
id='feature-selection-dropdown',
options=[
{'label': 'Random Forest Regressor', 'value': 'RF_normalized'},
{'label': 'ANOVA-F Value', 'value': 'ANOVA_normalized'},
{'label': 'Permutation Importance', 'value': 'Permutation_normalized'},
{'label': 'RFE(Recursive Feature Selection)', 'value': 'RFE_normalized'}
],
value='RF_normalized'
),
dcc.Graph(id='feature-importance-graph')
], style={'textAlign': 'center'})
@app.callback(
Output('feature-importance-graph', 'figure'),
[Input('feature-selection-dropdown', 'value')]
)
def update_graph(selected_technique):
# Sort the DataFrame based on the selected technique's score in descending order
sorted_df = feature_importance_all.sort_values(by=selected_technique, ascending=False)
# Create the bar plot with the sorted DataFrame
fig = px.bar(sorted_df, x='Feature', y=selected_technique, color='Feature')
fig.update_layout(
title='<b>Feature Importance Comparison</b>', title_x=0.5,
xaxis_title='<b>Feature</b>',
yaxis_title='<b>Normalized Importance</b>',
legend_title='<b>Feature Selection Technique</b>'
)
return fig
if __name__ == '__main__':
app.run_server(debug=True, port=8054)
################################################################################################################
# Mark features as important (1) or not important (0) in each method
feature_importance_all['RF_important'] = (feature_importance_all['RF'] >= feature_importance_all['RF'].median()).astype(int)
feature_importance_all['ANOVA_important'] = (feature_importance_all['ANOVA'] >= feature_importance_all['ANOVA'].median()).astype(int)
feature_importance_all['Permutation_important'] = (feature_importance_all['Permutation'] >= feature_importance_all['Permutation'].median()).astype(int)
feature_importance_all['RFE_important'] = (feature_importance_all['RFE'] <= feature_importance_all['RFE'].median()).astype(int) # Lower rank is better
# Calculate the total number of methods that find each feature important
feature_importance_all['Total_important'] = (feature_importance_all[['RF_important', 'ANOVA_important', 'Permutation_important', 'RFE_important']].sum(axis=1))
print(feature_importance_all[['Feature', 'Total_important']])
# Select features that meet the consensus threshold
consensus_features = feature_importance_all[feature_importance_all['Total_important'] >= consensus_threshold]
# Prepare the data for plotting
consensus_feature_importance = consensus_features[['RF', 'ANOVA', 'Permutation', 'RFE']]
consensus_feature_importance = consensus_feature_importance.rename(columns={
'RF': 'Random Forest Regressor',
'ANOVA': 'Anova F-Value',
'Permutation': 'Permutation Feature Importance',
'RFE': 'Recursive Feature Elimination'
})
app = dash.Dash(__name__)
app.layout = html.Div([
html.H4("Feature Selection Consensus Threshold:", style={'textAlign': 'center'}),
dcc.Dropdown(
id='consensus-threshold-dropdown',
options=[{'label': i, 'value': i} for i in range(1, 5)],
value=2
),
dcc.Graph(id='feature-importance-graph')
], style={'textAlign': 'center'})
@app.callback(
Output('feature-importance-graph', 'figure'),
[Input('consensus-threshold-dropdown', 'value')]
)
def update_graph(consensus_threshold):
consensus_features = feature_importance_all[feature_importance_all['Total_important'] >= consensus_threshold]
consensus_feature_importance = consensus_features[['RF', 'ANOVA', 'Permutation', 'RFE']]
consensus_feature_importance = consensus_feature_importance.rename(columns={
'RF': 'Random Forest Regressor',
'ANOVA': 'Anova F-Value',
'Permutation': 'Permutation Feature Importance',
'RFE': 'Recursive Feature Elimination'
})
consensus_feature_importance.index = consensus_features['Feature']
fig = go.Figure()
for col in consensus_feature_importance.columns:
fig.add_trace(go.Bar(x=consensus_feature_importance.index,
y=consensus_feature_importance[col],
name=col))
fig.update_layout(barmode='stack',
title='<b>Consensus Among Feature Selection Methods - Feature Importance</b>', title_x=0.5,
xaxis_title='<b>Feature</b>',
yaxis_title='<b>Importance Score</b>',
legend_title='<b>Feature Importance Selection Method')
return fig
if __name__ == '__main__':
app.run_server(debug=True,port=8055)
Feature Total_important 0 JOB_TYPE_DESCRIPTION 1 1 CONTRACTOR 2 2 Property Type 3 3 Jobsourcedescription 2 4 Initial Priority Description 1 5 Latest Priority Description 2 6 JOB_STATUS_DESCRIPTION 2 7 TRADE_DESCRIPTION 2 8 ABANDON_REASON_DESC 3 9 SOR_DESCRIPTION 1 10 Mgt Area 4
Checking for data distribution ("Task Completion Time" and "Total Repair Cost") and relationship between them i.e. linear or non-linearΒΆ
# Plot histograms and box plots
df_cleaned = int_df_copy.dropna(subset=['Task_completion_time'])
plt.figure(figsize=(18, 8))
plt.subplot(2, 3, 1)
sns.histplot(df_cleaned['Task_completion_time'], bins=30, kde=True)
plt.title('Histogram - Task_completion_time')
plt.subplot(2, 3, 4)
sns.boxplot(y=df_cleaned['Task_completion_time'])
plt.title('Box Plot - Task_completion_time')
plt.subplot(2, 3, 2)
sns.histplot(df_cleaned['Total Value'], bins=30, kde=True)
plt.title('Histogram - Total Value')
plt.subplot(2, 3, 5)
sns.boxplot(y=df_cleaned['Total Value'])
plt.title('Box Plot - Total Value')
# Plot Q-Q plots
plt.subplot(2, 3, 3)
stats.probplot(df_cleaned['Task_completion_time'], plot=plt)
plt.title('Q-Q Plot - Task_completion_time')
plt.subplot(2, 3, 6)
stats.probplot(df_cleaned['Total Value'], plot=plt)
plt.title('Q-Q Plot - Total Value')
plt.tight_layout()
plt.show()
# Shapiro-Wilk test for normality
stat_tc, p_value_tc = stats.shapiro(df_cleaned['Task_completion_time'])
stat_tv, p_value_tv = stats.shapiro(df_cleaned['Total Value'])
print(f'Shapiro-Wilk test for normality:')
print(f'Task_completion_time - Statistic: {stat_tc}, p-value: {p_value_tc}')
print(f'Total Value - Statistic: {stat_tv}, p-value: {p_value_tv}')
Shapiro-Wilk test for normality: Task_completion_time - Statistic: 0.5657979846000671, p-value: 0.0 Total Value - Statistic: 0.20409810543060303, p-value: 0.0
C:\Users\dmish\anaconda3\lib\site-packages\scipy\stats\_morestats.py:1800: UserWarning: p-value may not be accurate for N > 5000.
Q-Q Plot (quantile-quantile) Analysis:ΒΆ
- In the provided Q-Q plots, it is evident that for both 'Task_completion_time' and 'Total Value', the data points deviate significantly from the red line, especially towards the ends of the distribution.
- This deviation suggests that both variables are not normally distributed and are right-skewed with a long tail of higher values.
Histogram Analysis:ΒΆ
- Here The histograms provided for both 'Task_completion_time' and 'Total Value' show a concentration of values near the lower end, with a long tail to the right. This pattern is characteristic of a positively skewed distribution.
- The peak is not in the center and is instead shifted towards the left of the histogram, indicating that most of the data are concentrated in the lower range of values.
Box Plot AnalysisΒΆ
- Task_completion_time: The box plot shows many points above the upper whisker, which are potential outliers. This reinforces the right-skewness observed in the histogram.
- The median is closer to the bottom of the box, which further indicates a right-skewed distribution.
- Total Value: The box plot reveals a large number of potential outliers above the upper whisker, suggesting extreme values that are far from the median.
- Like the 'Task_completion_time', the median of 'Total Value' is towards the lower end of the box, indicating right-skewness.
Analysis Conclusion:ΒΆ
- In summary,Both show exhibit right-skewed distributions, and do not adhere to the assumptions of normality.
- Both 'Task_completion_time' and 'Total Value' display right-skewed distributions. The presence of outliers and the long tail to the right suggest that there are instances of unusually high values which could be due to complex or extensive repairs that take longer and cost more. This skewness implies that median values would be a better measure of central tendency than the mean, as the mean can be heavily influenced by outliers in a skewed distribution.
Relationship between Predictor (Task_completion_time) and response variable (Total_Value)ΒΆ
df_cleaned = int_df_copy.dropna(subset=['Task_completion_time'])
# Scatter plot
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.scatterplot(x='Task_completion_time', y='Total Value', data=df_cleaned, alpha=0.5)
plt.xlabel('Task Completion Time')
plt.ylabel('Total Value')
plt.title('Scatter Plot - Task_completion_time vs Total Value')
# Spearman correlation for the two variables
spearman_corr, spearman_p_value = spearmanr(df_cleaned['Task_completion_time'], df_cleaned['Total Value'])
# Correlation plot for the two variables
plt.subplot(1, 2, 2)
sns.heatmap(df_cleaned[['Task_completion_time', 'Total Value']].corr(method='spearman'), annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Spearman Correlation Heatmap - Task_completion_time vs Total Value')
plt.tight_layout()
plt.show()
# Print out the Spearman correlation coefficient and p-value
print(f"Spearman correlation coefficient: {spearman_corr}")
print(f"P-value: {spearman_p_value}")
# Interpretation
if spearman_p_value < 0.05: # Assuming a significance level of 0.05
if abs(spearman_corr) > 0.7:
print("There is a strong monotonic relationship.")
elif abs(spearman_corr) > 0.3:
print("There is a moderate monotonic relationship.")
else:
print("There is a weak monotonic relationship.")
else:
print("The relationship is not statistically significant.")
Spearman correlation coefficient: 0.008254867011982858 P-value: 0.2373049805896757 The relationship is not statistically significant.
Interpreting the above metrics to understand the relationship between "Total Repair Value" and "Task Completion Time"ΒΆ
Choosing the Model selection based on this data relationship:ΒΆ
Correlation coefficient (r): 0.10261301025652392 P-value: 4.22938032972636e-49
- As p-value is significantly small (<=0.05), This suggests that there is a weak but statistically significant linear relationship between 'Task Completion Time' and 'Total Value' (repair costs) in the data.
Interpretation:ΒΆ
Statistical Significance: The very low p-value confirms that the relationship between 'Task Completion Time' and 'Total Value' is not due to random chance. This means that 'Task Completion Time' does have some degree of linear association with repair costs.
Weak Linear Relationship: The correlation coefficient of 0.103, though statistically significant, is quite low. This indicates that while there is a linear relationship, it is weak. This means that 'Task Completion Time' alone may not be a strong predictor of repair costs.
With the understanding that "Coorelation is not causation", From a business perspective, even a weak relationship suggests that 'Task Completion Time' should not be entirely ignored as it is highly statistical significant.
It may be one of several factors influencing repair costs, and its impact might be more pronounced when analyzed in conjunction with other variables.
Model Choice justification:ΒΆ
We have weak but statistically significant postive linear relationship between Predictor (Task_completion_time) and response (Total_Value) as SPEARMAN correlation coefficient indicate here.
Having the presence of a new feature engineerd numeric variable(Task_completion_time) in the presence of many categorical predictors alongwith presence of a few outliers, Gradient Boosting iterative optimization techniques could be leveraged to optimize the loss for model
Scatter Plot Interpretation:ΒΆ
- Most of the data points are clustered near the origin, suggesting that many repairs are of lower cost and completed in less time.
- There are a few points spread out, indicating some tasks that take longer and cost more, but these are exceptions rather than the norm.
- There is no obvious upward or downward trend in the scatter plot, suggesting a lack of strong linear or monotonic relationship.
Spearman Correlation Heatmap Interpretation:ΒΆ
The Spearman correlation coefficient is very close to zero (0.0083), indicating almost no monotonic relationship between 'Task_completion_time' and 'Total Value'.
The p-value is greater than the typical alpha level of 0.05, which means the correlation observed is not statistically significant, and there's a high chance that any correlation is due to random variation in the data.
Deciding on Modelling based on the above analysis :ΒΆ
The lack of a significant correlation(non-significant Spearman correlation) means that 'Task_completion_time' is not strongly predictive of 'Total Value' in a linear or monotonic sense.
Since here there is no presence of simple linear or monotonic relationship with repair costs, we can use 'Task_completion_time' with other predictors for an ensemble modelling approach.
Using 'Task_completion_time' as part of a broader set of features to capture more aspects of the data's variance using an ensembling modelling approach.
Gradient Boosting(XGB) - Model-1: (11 predictors)-( with 'Task Completion Time')ΒΆ
predictors = ['JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Property Type', 'Jobsourcedescription', 'Initial Priority Description', 'Latest Priority Description', 'JOB_STATUS_DESCRIPTION', 'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC', 'Mgt Area', 'Task_completion_time']
response = 'Total Value'
# Define predictor variables and response variable
predictors = ['JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Property Type', 'Jobsourcedescription',
'Initial Priority Description', 'Latest Priority Description', 'JOB_STATUS_DESCRIPTION',
'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC', 'Mgt Area', 'Task_completion_time']
response = 'Total Value'
# One-hot encode categorical variables
categorical_features = predictors
one_hot = OneHotEncoder(handle_unknown='ignore')
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")
# Split the data
X = int_df_copy_cleaned[predictors]
y = int_df_copy_cleaned[response]
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Create a Pipeline with XGBoost model
pipeline = Pipeline([
('transformer', transformer),
('model', XGBRegressor(random_state=42))
])
# Parameter distributions for Grid Search
param_distributions = {
'model__n_estimators': [300],
'model__max_depth': [9],
'model__learning_rate': [0.01],
'model__subsample': [0.9]
}
# Set up K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(pipeline, param_distributions, cv=kfold, scoring='neg_mean_squared_error', verbose=2)
grid_search.fit(X_train, y_train)
# Best pipeline and parameters
best_pipeline = grid_search.best_estimator_
best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")
# Predict and evaluate on training, testing, and validation sets
y_train_preds = best_pipeline.predict(X_train)
y_test_preds = best_pipeline.predict(X_test)
y_val_preds = best_pipeline.predict(X_val)
# Calculate MSE, RMSE, MAE, and RΒ² for each set
def calculate_metrics(y_true, y_pred):
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
return mse, rmse, mae, r2
train_metrics = calculate_metrics(y_train, y_train_preds)
test_metrics = calculate_metrics(y_test, y_test_preds)
val_metrics = calculate_metrics(y_val, y_val_preds)
# Print results
print(f"Training Metrics (MSE, RMSE, MAE, RΒ²): {train_metrics}")
print(f"Testing Metrics (MSE, RMSE, MAE, RΒ²): {test_metrics}")
print(f"Validation Metrics (MSE, RMSE, MAE, RΒ²): {val_metrics}")
# Residuals vs Predicted Plot for Validation Set
residuals_val = y_val - y_val_preds
plt.figure(figsize=(8, 6))
plt.scatter(y_val_preds, residuals_val, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
# Actual vs Predicted Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_val, y=y_val_preds, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
def plot_learning_curves(model, X_train, y_train, X_val, y_val, step=50, max_data_points=1000):
train_errors, val_errors = [], []
m_values = range(1, min(len(X_train), max_data_points) + 1, step)
for m in m_values:
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_mse = mean_squared_error(y_train[:m], y_train_predict)
val_mse = mean_squared_error(y_val, y_val_predict)
train_errors.append(train_mse)
val_errors.append(val_mse)
plt.figure(figsize=(10, 6))
plt.plot(m_values, np.sqrt(train_errors), label="Train")
plt.plot(m_values, np.sqrt(val_errors), label="Validation")
plt.xlabel("Training set size")
plt.ylabel("RMSE (Root Mean Squared Error)")
plt.title("Learning Curves", fontweight='bold')
plt.legend()
plt.show()
# Using the best_pipeline for learning curve plot
plot_learning_curves(best_pipeline, X_train, y_train, X_val, y_val, step=50, max_data_points=1000)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 2.7s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.9s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.7s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.5s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.5s
Best Parameters: {'model__learning_rate': 0.01, 'model__max_depth': 9, 'model__n_estimators': 300, 'model__subsample': 0.9}
Training Metrics (MSE, RMSE, MAE, RΒ²): (47324.95511787174, 217.54299602118138, 84.3050034789942, 0.8625266274392533)
Testing Metrics (MSE, RMSE, MAE, RΒ²): (90568.19396359427, 300.94549998894195, 93.42650755444166, 0.5167482192894303)
Validation Metrics (MSE, RMSE, MAE, RΒ²): (75996.17538891257, 275.6740382932578, 92.27464085573645, 0.7201746878417186)
Random Forest Modelling (predictors with Completion Time) - 9 PredictorsΒΆ
This is for Contrastive comparative analysis with Gradient Boost Technique for same set of predictors in combination with "Task Completion Time"ΒΆ
Please note, this contrastive comparsion is done with below XGB Model No-3; (cell-No- 143)ΒΆ
# Define predictor variables and response variable
predictors = ['Property Type', 'Jobsourcedescription',
'ABANDON_REASON_DESC', 'JOB_STATUS_DESCRIPTION',
'TRADE_DESCRIPTION', 'Latest Priority Description', 'Mgt Area', 'CONTRACTOR', 'Task_completion_time']
response = 'Total Value'
# One-hot encode categorical variables
categorical_features = predictors
one_hot = OneHotEncoder(handle_unknown='ignore')
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")
# Split the data
X = int_df_copy_cleaned[predictors]
y = int_df_copy_cleaned[response]
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Transform the datasets
X_train_transformed = transformer.fit_transform(X_train)
X_test_transformed = transformer.transform(X_test)
X_val_transformed = transformer.transform(X_val)
# Create a Pipeline with Random Forest model
pipeline = Pipeline([
('transformer', transformer),
('model', RandomForestRegressor(random_state=42))
])
# Parameter distributions for Randomized Search
param_distributions = {
'model__n_estimators': [300],
'model__max_depth': [20],
'model__min_samples_leaf': [1],
'model__max_features': ['auto']
}
# Set up K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV for hyperparameter tuning
random_search = GridSearchCV(pipeline, param_distributions, cv=kfold,
scoring='neg_mean_squared_error', verbose=2)
random_search.fit(X_train, y_train)
# Best model and parameters
best_model = random_search.best_estimator_.named_steps['model']
best_params = random_search.best_params_
print(f"Best Parameters: {best_params}")
# Predict and evaluate on training, testing, and validation sets
y_train_preds = best_model.predict(X_train_transformed)
y_test_preds = best_model.predict(X_test_transformed)
y_val_preds = best_model.predict(X_val_transformed)
# Calculate MSE for each set
train_mse = mean_squared_error(y_train, y_train_preds)
test_mse = mean_squared_error(y_test, y_test_preds)
val_mse = mean_squared_error(y_val, y_val_preds)
# Print MSE results
print(f"Training MSE: {train_mse}")
print(f"Validation MSE: {val_mse}")
print(f"Testing MSE: {test_mse}")
# Additional evaluation metrics
# Calculate RMSE for the training/testing/validation set
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_preds))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_preds))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_preds))
# Print the RMSE values
print(f"Training RMSE: {train_rmse}")
print(f"Testing RMSE: {test_rmse}")
print(f"Validation RMSE: {val_rmse}")
# Calculate MAE for training/testing/validation data
train_mae = mean_absolute_error(y_train, y_train_preds)
val_mae = mean_absolute_error(y_val, y_val_preds)
test_mae = mean_absolute_error(y_test, y_test_preds)
# Calculate RΒ² for training/testing/validation data
train_r2 = r2_score(y_train, y_train_preds)
val_r2 = r2_score(y_val, y_val_preds)
test_r2 = r2_score(y_test, y_test_preds)
# Print MAE results
print(f"Training MAE: {train_mae}")
print(f"Validation MAE: {val_mae}")
print(f"Testing MAE: {test_mae}")
# Print R2 results
print(f"Training RΒ²: {train_r2}")
print(f"Validation R^2: {val_r2}")
print(f"Testing RΒ²: {test_r2}")
# Print MAPE results
# print(f"Training MAPE: {val_mape}")
# print(f"Validation MAPE: {val_mape}")
# print(f"Testing MAPE: {val_mape}")
# Residuals vs Predicted Plot
residuals = y_val - y_val_preds
plt.figure(figsize=(8, 6))
plt.scatter(y_val_preds, residuals, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Plot', fontweight="bold", fontsize = 16)
plt.show()
# Actual vs Predicted Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_val, y=y_val_preds, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Plot', fontweight="bold", fontsize = 16)
plt.show()
# Feature Importance Plot
# Fitting the OneHotEncoder separately to get the feature names
one_hot.fit(X_train[categorical_features])
feature_names = one_hot.get_feature_names_out(input_features=categorical_features)
feature_importances = best_model.feature_importances_
sorted_idx = feature_importances.argsort()
# Print sorted feature importances
print("Sorted Feature Importances:")
for idx in sorted_idx:
print(f"{feature_names[idx]}: {feature_importances[idx]}")
plt.figure(figsize=(10, 8))
plt.barh(range(len(sorted_idx)), feature_importances[sorted_idx], align='center')
plt.yticks(range(len(sorted_idx)), [feature_names[i] for i in sorted_idx])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance', fontweight="bold", fontsize = 16)
plt.show()
def plot_learning_curves(model, X_train, y_train, X_val, y_val, X_test, y_test, step=50, max_data_points=1000):
train_errors, val_errors, test_errors = [], [], []
# Use shape[0] to get the number of samples in the training set
n_train_samples = min(max_data_points, X_train.shape[0])
for m in range(1, n_train_samples, step):
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
y_test_predict = model.predict(X_test)
train_mse = mean_squared_error(y_train[:m], y_train_predict)
val_mse = mean_squared_error(y_val, y_val_predict)
test_mse = mean_squared_error(y_test, y_test_predict)
train_errors.append(train_mse)
val_errors.append(val_mse)
test_errors.append(test_mse)
plt.plot(np.sqrt(train_errors), label="Train")
plt.plot(np.sqrt(val_errors), label="Validation")
plt.plot(np.sqrt(test_errors), label="Test")
plt.xlabel("Training set size")
plt.ylabel("RMSE (Root Mean Squared Error)")
plt.legend()
plt.title("Training, Validation, and Test Loss Curves", fontweight="bold", fontsize = 16)
plt.show()
# Using the best_model that has been fitted to the entire training dataset
plot_learning_curves(best_model, X_train_transformed, y_train, X_val_transformed, y_val, X_test_transformed, y_test, step=50, max_data_points=1000)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 13.5s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 13.1s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 13.5s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 14.2s
[CV] END model__max_depth=20, model__max_features=auto, model__min_samples_leaf=1, model__n_estimators=300; total time= 13.5s
Best Parameters: {'model__max_depth': 20, 'model__max_features': 'auto', 'model__min_samples_leaf': 1, 'model__n_estimators': 300}
Training MSE: 47750.582314661144
Validation MSE: 97288.14752414273
Testing MSE: 130058.44016876609
Training RMSE: 218.5190662497466
Testing RMSE: 360.6361603732578
Validation RMSE: 311.91047998447044
Training MAE: 75.66955634244547
Validation MAE: 92.80726791245644
Testing MAE: 96.37979346153512
Training RΒ²: 0.8612902309957601
Validation R^2: 0.6417755747453582
Testing RΒ²: 0.30603703069027155
Sorted Feature Importances: Task_completion_time_371.0: 0.0 Task_completion_time_87.0: 0.0 Task_completion_time_88.0: 0.0 Task_completion_time_93.0: 0.0 Task_completion_time_95.0: 0.0 Task_completion_time_96.0: 0.0 Task_completion_time_97.0: 0.0 Task_completion_time_100.0: 0.0 Task_completion_time_102.0: 0.0 Task_completion_time_105.0: 0.0 Task_completion_time_109.0: 0.0 Task_completion_time_112.0: 0.0 Task_completion_time_114.0: 0.0 Latest Priority Description_Emergency Health and Safety: 0.0 Task_completion_time_118.0: 0.0 ABANDON_REASON_DESC_Work Deferred: 0.0 Task_completion_time_86.0: 0.0 ABANDON_REASON_DESC_Testing: 0.0 Task_completion_time_85.0: 0.0 TRADE_DESCRIPTION_Inspection: 0.0 CONTRACTOR_Contractor 1: 0.0 Latest Priority Description_38 Calendar Days - Compliance: 0.0 CONTRACTOR_Contractor 14: 0.0 CONTRACTOR_Contractor 17: 0.0 Latest Priority Description_10 Working Days - Compliance: 0.0 CONTRACTOR_Contractor 2: 0.0 CONTRACTOR_Contractor 26: 0.0 CONTRACTOR_Contractor 28: 0.0 CONTRACTOR_Contractor 29: 0.0 CONTRACTOR_Contractor 3: 0.0 TRADE_DESCRIPTION_Play and Recreation: 0.0 CONTRACTOR_Contractor 9: 0.0 Task_completion_time_59.0: 0.0 Task_completion_time_71.0: 0.0 Task_completion_time_74.0: 0.0 Task_completion_time_82.0: 0.0 ABANDON_REASON_DESC_Tenant Refusal: 0.0 Task_completion_time_116.0: 0.0 Task_completion_time_160.0: 0.0 Task_completion_time_148.0: 0.0 Jobsourcedescription_Letter: 0.0 Task_completion_time_154.0: 0.0 Task_completion_time_157.0: 0.0 ABANDON_REASON_DESC_See Repair Memo: 0.0 Task_completion_time_161.0: 0.0 Task_completion_time_140.0: 0.0 Task_completion_time_180.0: 0.0 Task_completion_time_188.0: 0.0 Task_completion_time_192.0: 0.0 Task_completion_time_195.0: 0.0 Task_completion_time_202.0: 0.0 Task_completion_time_318.0: 0.0 Task_completion_time_319.0: 0.0 Task_completion_time_183.0: 0.0 Task_completion_time_133.0: 0.0 Latest Priority Description_700 Calendar Days - Compliance: 0.0 Task_completion_time_120.0: 0.0 Task_completion_time_128.0: 0.0 Task_completion_time_119.0: 0.0 ABANDON_REASON_DESC_Data Clean Up: 0.0 ABANDON_REASON_DESC_No Charge: 0.0 Task_completion_time_127.0: 0.0 Task_completion_time_131.0: 0.0 Task_completion_time_123.0: 0.0 ABANDON_REASON_DESC_Riverside Not Approved: 0.0 ABANDON_REASON_DESC_Abortive Call: 0.0 ABANDON_REASON_DESC_Contractor Link Reason: 0.0 Task_completion_time_132.0: 0.0 Task_completion_time_169.0: 8.596633616031622e-23 CONTRACTOR_Contractor 11: 1.6845588345639382e-22 Task_completion_time_106.0: 1.700258763305496e-22 ABANDON_REASON_DESC_Work Under Guarantee: 1.855908169687779e-22 Task_completion_time_326.0: 2.969116262053972e-22 Task_completion_time_168.0: 3.189633210652828e-22 TRADE_DESCRIPTION_: 3.368094495958839e-22 Task_completion_time_173.0: 4.13299488028395e-22 CONTRACTOR_Contractor 25: 6.212436173204644e-22 ABANDON_REASON_DESC_Tenant Missed Appt: 1.2310977807128056e-21 Task_completion_time_78.0: 6.446054247450958e-19 ABANDON_REASON_DESC_Added to Planned Programme: 7.181577245033252e-19 Task_completion_time_94.0: 1.0145856301110364e-18 Task_completion_time_179.0: 1.4363716709781573e-18 Task_completion_time_166.0: 1.931821865093293e-18 Task_completion_time_216.0: 3.726433076351086e-18 CONTRACTOR_Contractor 27: 7.064119743535443e-10 CONTRACTOR_Contractor 31: 1.4242748052402477e-09 TRADE_DESCRIPTION_Water: 2.3299540949589354e-09 Task_completion_time_144.0: 3.66644099062805e-09 CONTRACTOR_Contractor 15: 5.7394950513935975e-09 Property Type_0: 8.117308871401342e-09 CONTRACTOR_Contractor 22: 8.75001567279794e-09 CONTRACTOR_Contractor 19: 1.0133732380530727e-08 CONTRACTOR_Contractor 18: 1.0628013199628833e-08 Task_completion_time_68.0: 1.0715858259213704e-08 TRADE_DESCRIPTION_Door Access Control: 1.2677189705639791e-08 TRADE_DESCRIPTION_Mechanical Services: 2.019873413518306e-08 TRADE_DESCRIPTION_Warden Call: 3.9723912755623465e-08 CONTRACTOR_Contractor 30: 6.490685443694437e-08 Task_completion_time_58.0: 6.491581521520884e-08 CONTRACTOR_Contractor 13: 7.60774557536175e-08 Task_completion_time_98.0: 8.024151821381054e-08 Task_completion_time_61.0: 1.5199515653448053e-07 Task_completion_time_91.0: 1.6071926042013656e-07 Latest Priority Description_Health & Safety - Compliance - 4 Hours: 2.9910978619634425e-07 Latest Priority Description_112 Calendar Days - Compliance: 3.14846349870386e-07 Task_completion_time_67.0: 3.4490490740277294e-07 TRADE_DESCRIPTION_Out of Hours Work: 3.8934767133674234e-07 ABANDON_REASON_DESC_Wrong Contractor: 4.19868511643045e-07 ABANDON_REASON_DESC_No Access: 5.224732639924674e-07 Task_completion_time_122.0: 5.731664253411416e-07 TRADE_DESCRIPTION_Asbestos: 6.485549438570571e-07 ABANDON_REASON_DESC_Input Error: 7.853797065471078e-07 CONTRACTOR_Contractor 16: 8.132918021653195e-07 Task_completion_time_124.0: 8.565314900664386e-07 Task_completion_time_138.0: 1.2308009375392945e-06 Mgt Area_MA3: 1.2830152975553848e-06 Task_completion_time_77.0: 1.3711831770980538e-06 Property Type_Other Non-Rentable Space: 2.516823876798481e-06 Jobsourcedescription_Asset Manager: 3.1576069876313526e-06 Latest Priority Description_Urgent GAS Evolve RD Irvine EMB: 3.4017883216882597e-06 Latest Priority Description_Three Day Void: 4.685328442006981e-06 Task_completion_time_81.0: 5.0610262378626265e-06 Task_completion_time_69.0: 5.152579151028794e-06 Task_completion_time_66.0: 5.7364497325373385e-06 CONTRACTOR_Contractor 21: 5.993080289453846e-06 Task_completion_time_141.0: 7.346836853755029e-06 Task_completion_time_104.0: 7.4740853759751765e-06 Jobsourcedescription_CSC Web Chat: 7.597773678420342e-06 Task_completion_time_70.0: 1.0127521030459305e-05 Task_completion_time_60.0: 1.0836674293054006e-05 Task_completion_time_103.0: 1.1738407372963718e-05 ABANDON_REASON_DESC_Duplicate Order: 1.1931988785829361e-05 CONTRACTOR_Contractor 4: 1.2987395343876894e-05 Latest Priority Description_3 Working Days - Compliance: 1.3205614090964245e-05 Task_completion_time_72.0: 1.3271398488797455e-05 CONTRACTOR_Contractor 12: 1.4625538347248285e-05 Latest Priority Description_Discretionary: 1.562068039536393e-05 Task_completion_time_73.0: 1.7142038524208444e-05 Task_completion_time_26.0: 1.7376603449166197e-05 ABANDON_REASON_DESC_Inspection Not Required: 1.8552795010747634e-05 Task_completion_time_110.0: 2.1930125914874337e-05 Latest Priority Description_335 Calendar Days - Compliance: 2.289449791427414e-05 Task_completion_time_111.0: 2.438169007673811e-05 Task_completion_time_37.0: 2.51079510334001e-05 Jobsourcedescription_CSC Email: 2.645190325738123e-05 Task_completion_time_52.0: 2.656204001127552e-05 Latest Priority Description_56 Calendar Days - Compliance: 2.7185590169503682e-05 Task_completion_time_31.0: 3.047601798733041e-05 TRADE_DESCRIPTION_Rechargeable: 3.4674834331392705e-05 CONTRACTOR_Contractor 20: 3.484200331976317e-05 Task_completion_time_54.0: 3.621508033456257e-05 Jobsourcedescription_Compliance System: 3.723714943540842e-05 Task_completion_time_126.0: 3.9926336213436246e-05 Latest Priority Description_Emergency - Compliance - 12 Hours: 4.101259659196526e-05 Task_completion_time_108.0: 4.157226606820978e-05 Task_completion_time_25.0: 5.5485561030277e-05 Task_completion_time_39.0: 5.895010384090044e-05 CONTRACTOR_Contractor 8: 5.9133762359309454e-05 ABANDON_REASON_DESC_Alternative Job: 6.040189378303856e-05 CONTRACTOR_Contractor 7: 7.285044334826397e-05 Task_completion_time_49.0: 7.550799584660167e-05 Task_completion_time_113.0: 8.815283572727378e-05 Task_completion_time_101.0: 9.261445835735041e-05 Task_completion_time_-2.0: 9.400577980716287e-05 Task_completion_time_27.0: 9.573184938282712e-05 CONTRACTOR_Contractor 24: 0.00010082437225526596 Property Type_Block No Shared Area: 0.00010288102322378254 Task_completion_time_16.0: 0.0001052505882151844 Task_completion_time_99.0: 0.00010610471296993644 Task_completion_time_187.0: 0.00011222277862276834 Task_completion_time_47.0: 0.00011423516367165157 Task_completion_time_10.0: 0.00013765207669053697 Task_completion_time_142.0: 0.000138518431762315 TRADE_DESCRIPTION_Lifts: 0.000149590329990948 Task_completion_time_56.0: 0.0001506624334192011 TRADE_DESCRIPTION_Glazing: 0.00015925597927963158 Task_completion_time_48.0: 0.00016001078950281407 Latest Priority Description_7 Working Days - Compliance: 0.0001629707267889514 CONTRACTOR_Contractor 23: 0.0001647566367131947 Latest Priority Description_76 Calendar Days - Compliance: 0.0001675560514691361 Task_completion_time_24.0: 0.00016977824700982342 TRADE_DESCRIPTION_Fire: 0.00017299395906146802 Task_completion_time_-5.0: 0.00017639868141084274 Task_completion_time_277.0: 0.0001867400950305211 CONTRACTOR_Contractor 6: 0.00019466056437117505 Task_completion_time_43.0: 0.00020949053966553972 CONTRACTOR_Contractor 10: 0.00021020470199067923 Task_completion_time_38.0: 0.00021314128612257187 Task_completion_time_279.0: 0.00022057530243686443 Latest Priority Description_Urgent - Compliance - 7 Calendar Days: 0.00022861435526006998 Task_completion_time_83.0: 0.00023914873818642592 Task_completion_time_53.0: 0.0002452484437578452 Jobsourcedescription_Compliance Officer: 0.0002474863266670223 TRADE_DESCRIPTION_Concrete External Works: 0.00025547283984222216 Task_completion_time_12.0: 0.0002604895140800704 Task_completion_time_125.0: 0.00026197592954376765 TRADE_DESCRIPTION_Disabled Adaptations: 0.0002722507575214536 Jobsourcedescription_Housing Officer: 0.0002797222679461495 Latest Priority Description_Pre Inspection 5 Working Days: 0.0003055840349961794 Task_completion_time_40.0: 0.00030949461092479773 Task_completion_time_55.0: 0.00031499443872910223 Task_completion_time_62.0: 0.00031647784798212713 Latest Priority Description_Urgent PFI Evolve RD Irvine EMB: 0.00033456638684073404 Task_completion_time_20.0: 0.0003411565693864877 Task_completion_time_42.0: 0.00034681546729345794 Task_completion_time_63.0: 0.0003690719045823201 Task_completion_time_57.0: 0.00038534146963044633 CONTRACTOR_N/A: 0.0004011722922886471 Task_completion_time_33.0: 0.0004136644597550892 Task_completion_time_-1.0: 0.00041790200561820506 Task_completion_time_3.0: 0.00044243265432439434 Task_completion_time_156.0: 0.0004613202973329435 Property Type_Default: 0.00047111577184353383 Task_completion_time_34.0: 0.00047191672261236866 Mgt Area_MA2: 0.0004968947432684349 Task_completion_time_41.0: 0.0005093960635617136 Task_completion_time_9.0: 0.0005103990469565215 Latest Priority Description_Emergency: 0.0005493549218450941 Task_completion_time_4.0: 0.0005517565361146478 Task_completion_time_76.0: 0.000568231827795853 CONTRACTOR_Contractor 5: 0.0005744677648193013 Task_completion_time_19.0: 0.0005883850003721804 Task_completion_time_45.0: 0.0006646539467874725 TRADE_DESCRIPTION_Painting and Decorating: 0.0006789573476926064 Mgt Area_MA1: 0.0006817246249222677 Task_completion_time_30.0: 0.0007504075763372862 Task_completion_time_75.0: 0.0007777702658119662 Jobsourcedescription_Via Website: 0.000781092569529942 Jobsourcedescription_Asset Officer: 0.0007859382575885055 Task_completion_time_145.0: 0.0007866159861782213 TRADE_DESCRIPTION_Brickwork/Blockwork: 0.0009016913990601176 Task_completion_time_5.0: 0.0009461597167595237 Task_completion_time_65.0: 0.0009734431706167568 TRADE_DESCRIPTION_Electrician: 0.0010352205732869024 Task_completion_time_64.0: 0.001072740354712857 Task_completion_time_89.0: 0.001084774136131809 Task_completion_time_90.0: 0.0010882504173848624 Task_completion_time_84.0: 0.0011396736798082528 TRADE_DESCRIPTION_Plumbing: 0.001161924623010049 Latest Priority Description_Major Responsive Repairs: 0.0011992473339298704 TRADE_DESCRIPTION_Pound Jobs No SOR: 0.0012382978691456478 Latest Priority Description_: 0.0012502439333758027 TRADE_DESCRIPTION_Drainage Works: 0.0012505220019410578 Latest Priority Description_28 Calendar Days - Compliance: 0.0014991998980849383 Task_completion_time_13.0: 0.001519909027555622 Task_completion_time_23.0: 0.0015291676526505205 Task_completion_time_28.0: 0.0017325253158171547 Task_completion_time_6.0: 0.002016988602796908 TRADE_DESCRIPTION_Scaffold: 0.002036146881149856 Task_completion_time_50.0: 0.0021058882647127983 Jobsourcedescription_CSC Phone Call: 0.0023288360639627838 Task_completion_time_1.0: 0.002429180491436385 Jobsourcedescription_Contractor Report: 0.0025126976908100675 ABANDON_REASON_DESC_No Work Required: 0.00271698882817425 Jobsourcedescription_Scheme Staff/Care and Support Staff: 0.002731868363694147 Task_completion_time_32.0: 0.003040182457011458 Property Type_Semi Detached: 0.00311448927230886 Task_completion_time_22.0: 0.0032389041626137154 Task_completion_time_21.0: 0.0032523758547758352 JOB_STATUS_DESCRIPTION_Work Completed: 0.003445602103363866 Task_completion_time_7.0: 0.0036233716872765176 Task_completion_time_15.0: 0.0036966764568399308 Property Type_Detached: 0.003745717748966251 Task_completion_time_51.0: 0.003799359706493768 Task_completion_time_11.0: 0.0039087711072583325 Task_completion_time_36.0: 0.004258965519026685 Task_completion_time_14.0: 0.0042838887325591145 Task_completion_time_8.0: 0.00480864106027507 Task_completion_time_79.0: 0.004845865433804142 TRADE_DESCRIPTION_Miscellaneous Works: 0.004999464242030051 Task_completion_time_2.0: 0.0052025275055334035 Latest Priority Description_Appointable: 0.005266104528775099 Task_completion_time_92.0: 0.006372334204059129 Property Type_Access via internal shared area: 0.006445086371216717 Task_completion_time_107.0: 0.006625463546144036 Jobsourcedescription_Repairs Administrator: 0.006680758802398296 Task_completion_time_18.0: 0.007566487864970625 Task_completion_time_17.0: 0.007854218572374963 TRADE_DESCRIPTION_Carpenter: 0.007883769504060795 Property Type_End Terrace: 0.008176972625951254 TRADE_DESCRIPTION_Floor Wall Ceilings: 0.008392969805704108 TRADE_DESCRIPTION_Roofing: 0.009004151825589276 Task_completion_time_-3.0: 0.00982608395131547 Task_completion_time_0.0: 0.010088385662515782 TRADE_DESCRIPTION_Groundwork: 0.010281829654574139 Task_completion_time_46.0: 0.010488420600200866 TRADE_DESCRIPTION_Gas Repairs: 0.010512764281032275 Property Type_Access direct: 0.010566060725488421 TRADE_DESCRIPTION_Multi Trade: 0.012097876617087505 Task_completion_time_80.0: 0.014177168568890041 Property Type_Terrace: 0.014671366497287671 Task_completion_time_35.0: 0.015774486656988152 TRADE_DESCRIPTION_Fencing: 0.015933761904250488 JOB_STATUS_DESCRIPTION_Invoice Accepted: 0.01715659386496159 Task_completion_time_29.0: 0.018806327966309887 Latest Priority Description_Section 11 Works: 0.020604979295077597 TRADE_DESCRIPTION_Void Repairs: 0.020744463848306922 Task_completion_time_44.0: 0.026155838237241866 Task_completion_time_129.0: 0.026708001325728422 JOB_STATUS_DESCRIPTION_Abandoned: 0.027583597119745194 ABANDON_REASON_DESC_nan: 0.031226868824100536 Latest Priority Description_Damp and Mould Follow-On Work: 0.036983448148959874 Latest Priority Description_Two Week Void: 0.04288812616708602 Jobsourcedescription_OneMobile app: 0.08900006455052502 Jobsourcedescription_Total Mobile App: 0.30079338954083795
Gradient Boosting(XGB) - Model-2: (9 predictors)-( without 'Task Completion Time')ΒΆ
predictors = ['JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Property Type', 'Jobsourcedescription', 'Latest Priority Description', 'JOB_STATUS_DESCRIPTION', 'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC','Mgt Area']
response = 'Total Value'
# Define predictor variables and response variable
predictors = ['JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Property Type', 'Jobsourcedescription', 'Initial Priority Description',
'Latest Priority Description', 'JOB_STATUS_DESCRIPTION',
'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC', 'Mgt Area']
response = 'Total Value'
# One-hot encode categorical variables
categorical_features = predictors
one_hot = OneHotEncoder(handle_unknown='ignore')
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")
# Split the data
X = int_df_copy_cleaned[predictors]
y = int_df_copy_cleaned[response]
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Create a Pipeline with XGBoost model
pipeline = Pipeline([
('transformer', transformer),
('model', XGBRegressor(random_state=42))
])
# Parameter distributions for Grid Search
param_distributions = {
'model__n_estimators': [300],
'model__max_depth': [9],
'model__learning_rate': [0.01],
'model__subsample': [0.9]
}
# Set up K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(pipeline, param_distributions, cv=kfold, scoring='neg_mean_squared_error', verbose=2)
grid_search.fit(X_train, y_train)
# Best pipeline and parameters
best_pipeline = grid_search.best_estimator_
best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")
# Predict and evaluate on training, testing, and validation sets
y_train_preds = best_pipeline.predict(X_train)
y_test_preds = best_pipeline.predict(X_test)
y_val_preds = best_pipeline.predict(X_val)
# Calculate MSE, RMSE, MAE, and RΒ² for each set
def calculate_metrics(y_true, y_pred):
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
return mse, rmse, mae, r2
train_metrics = calculate_metrics(y_train, y_train_preds)
test_metrics = calculate_metrics(y_test, y_test_preds)
val_metrics = calculate_metrics(y_val, y_val_preds)
# Print results
print(f"Training Metrics (MSE, RMSE, MAE, RΒ²): {train_metrics}")
print(f"Testing Metrics (MSE, RMSE, MAE, RΒ²): {test_metrics}")
print(f"Validation Metrics (MSE, RMSE, MAE, RΒ²): {val_metrics}")
# Residuals vs Predicted Plot for Validation Set
residuals_val = y_val - y_val_preds
plt.figure(figsize=(8, 6))
plt.scatter(y_val_preds, residuals_val, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
# Actual vs Predicted Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_val, y=y_val_preds, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
def plot_learning_curves(model, X_train, y_train, X_val, y_val, step=50, max_data_points=1000):
train_errors, val_errors = [], []
m_values = range(1, min(len(X_train), max_data_points) + 1, step)
for m in m_values:
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_mse = mean_squared_error(y_train[:m], y_train_predict)
val_mse = mean_squared_error(y_val, y_val_predict)
train_errors.append(train_mse)
val_errors.append(val_mse)
plt.figure(figsize=(10, 6))
plt.plot(m_values, np.sqrt(train_errors), label="Train")
plt.plot(m_values, np.sqrt(val_errors), label="Validation")
plt.xlabel("Training set size")
plt.ylabel("RMSE (Root Mean Squared Error)")
plt.title("Learning Curves", fontweight='bold')
plt.legend()
plt.show()
# Using the best_pipeline for learning curve plot
plot_learning_curves(best_pipeline, X_train, y_train, X_val, y_val, step=50, max_data_points=1000)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.0s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.0s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.0s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.1s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.3s
Best Parameters: {'model__learning_rate': 0.01, 'model__max_depth': 9, 'model__n_estimators': 300, 'model__subsample': 0.9}
Training Metrics (MSE, RMSE, MAE, RΒ²): (70036.87077302866, 264.6448011449094, 90.131791658153, 0.7965512105655763)
Testing Metrics (MSE, RMSE, MAE, RΒ²): (87643.633111057, 296.0466738726463, 91.93610740353245, 0.5323530268707005)
Validation Metrics (MSE, RMSE, MAE, RΒ²): (78331.57954012479, 279.8777939389347, 92.63513771001924, 0.7115755025237158)
# grid_search.best_params_
print(consensus_2FS_methods)
Feature RF ANOVA Permutation RFE \
1 CONTRACTOR 11.0 1.0 10.0 8.0
2 Property Type 7.0 9.0 9.0 9.0
3 Jobsourcedescription 1.0 10.0 1.0 3.0
5 Latest Priority Description 8.0 8.0 5.0 7.0
6 JOB_STATUS_DESCRIPTION 5.0 4.0 6.0 3.0
7 TRADE_DESCRIPTION 6.0 5.0 7.0 10.0
8 ABANDON_REASON_DESC 9.0 3.0 8.0 3.0
10 Mgt Area 10.0 7.0 11.0 3.0
RF_normalized ANOVA_normalized Permutation_normalized RFE_normalized \
1 1.000000 0.1 0.909091 0.333333
2 0.636364 0.9 0.818182 0.250000
3 0.090909 1.0 0.090909 0.750000
5 0.727273 0.8 0.454545 0.416667
6 0.454545 0.4 0.545455 0.750000
7 0.545455 0.5 0.636364 0.166667
8 0.818182 0.3 0.727273 0.750000
10 0.909091 0.7 1.000000 0.750000
RF_important ANOVA_important Permutation_important RFE_important \
1 1 0 1 0
2 1 1 1 0
3 0 1 0 1
5 1 1 0 0
6 0 0 1 1
7 1 0 1 0
8 1 0 1 1
10 1 1 1 1
Total_important
1 2
2 3
3 2
5 2
6 2
7 2
8 3
10 4
Model-3- Gradient Boosting(XGB)ΒΆ
(Ensemble mix - Consensus - Voted by at least 2 feature selection techniques ) + New feature variable ('Task_completion_time')ΒΆ
Predictors:
['CONTRACTOR','Property Type','Jobsourcedescription', 'Latest Priority Description','JOB_STATUS_DESCRIPTION','TRADE_DESCRIPTION','ABANDON_REASON_DESC','Mgt Area','Task_completion_time']
response = 'Total Value'
# Define predictor variables and response variable
predictors = ['CONTRACTOR','Property Type','Jobsourcedescription',
'Latest Priority Description','JOB_STATUS_DESCRIPTION','TRADE_DESCRIPTION','ABANDON_REASON_DESC','Mgt Area','Task_completion_time']
response = 'Total Value'
# One-hot encode categorical variables
categorical_features = predictors
one_hot = OneHotEncoder(handle_unknown='ignore')
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")
# Split the data
X = int_df_copy_cleaned[predictors]
y = int_df_copy_cleaned[response]
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Create a Pipeline with XGBoost model
pipeline = Pipeline([
('transformer', transformer),
('model', XGBRegressor(random_state=42))
])
# Parameter distributions for Grid Search
param_distributions = {
'model__n_estimators': [300],
'model__max_depth': [9],
'model__learning_rate': [0.01],
'model__subsample': [0.9]
}
# Set up K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(pipeline, param_distributions, cv=kfold, scoring='neg_mean_squared_error', verbose=2)
grid_search.fit(X_train, y_train)
# Best pipeline and parameters
best_pipeline = grid_search.best_estimator_
best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")
# Predict and evaluate on training, testing, and validation sets
y_train_preds = best_pipeline.predict(X_train)
y_test_preds = best_pipeline.predict(X_test)
y_val_preds = best_pipeline.predict(X_val)
# Calculate MSE, RMSE, MAE, and RΒ² for each set
def calculate_metrics(y_true, y_pred):
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
return mse, rmse, mae, r2
train_metrics = calculate_metrics(y_train, y_train_preds)
test_metrics = calculate_metrics(y_test, y_test_preds)
val_metrics = calculate_metrics(y_val, y_val_preds)
# Print results
print(f"Training Metrics (MSE, RMSE, MAE, RΒ²): {train_metrics}")
print(f"Testing Metrics (MSE, RMSE, MAE, RΒ²): {test_metrics}")
print(f"Validation Metrics (MSE, RMSE, MAE, RΒ²): {val_metrics}")
# Residuals vs Predicted Plot for Validation Set
residuals_val = y_val - y_val_preds
plt.figure(figsize=(8, 6))
plt.scatter(y_val_preds, residuals_val, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
# Actual vs Predicted Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_val, y=y_val_preds, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
def plot_learning_curves(model, X_train, y_train, X_val, y_val, step=50, max_data_points=1000):
train_errors, val_errors = [], []
m_values = range(1, min(len(X_train), max_data_points) + 1, step)
for m in m_values:
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_mse = mean_squared_error(y_train[:m], y_train_predict)
val_mse = mean_squared_error(y_val, y_val_predict)
train_errors.append(train_mse)
val_errors.append(val_mse)
plt.figure(figsize=(10, 6))
plt.plot(m_values, np.sqrt(train_errors), label="Train")
plt.plot(m_values, np.sqrt(val_errors), label="Validation")
plt.xlabel("Training set size")
plt.ylabel("RMSE (Root Mean Squared Error)")
plt.title("Learning Curves", fontweight='bold')
plt.legend()
plt.show()
# Using the best_pipeline for learning curve plot
plot_learning_curves(best_pipeline, X_train, y_train, X_val, y_val, step=50, max_data_points=1000)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.3s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.3s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.3s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.4s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.3s
Best Parameters: {'model__learning_rate': 0.01, 'model__max_depth': 9, 'model__n_estimators': 300, 'model__subsample': 0.9}
Training Metrics (MSE, RMSE, MAE, RΒ²): (60399.76691824265, 245.76364035032248, 88.24077103344999, 0.8245458524059295)
Testing Metrics (MSE, RMSE, MAE, RΒ²): (115673.91906475218, 340.10868713508654, 98.73325144192887, 0.38278964255065295)
Validation Metrics (MSE, RMSE, MAE, RΒ²): (85188.95028862239, 291.8714619290868, 92.75539826260389, 0.6863259962102262)
Model-4- Gradient Boosting(XGB)ΒΆ
(Ensemble mix - Consensus - Voted by at least 2 feature selection techniques ) -- without New feature variable ('Task_completion_time')ΒΆ
Predictors:
['CONTRACTOR','Property Type','Jobsourcedescription', 'Latest Priority Description','JOB_STATUS_DESCRIPTION','TRADE_DESCRIPTION','ABANDON_REASON_DESC','Mgt Area']
response = 'Total Value'
# Define predictor variables and response variable
predictors = ['CONTRACTOR','Property Type','Jobsourcedescription',
'Latest Priority Description','JOB_STATUS_DESCRIPTION','TRADE_DESCRIPTION','ABANDON_REASON_DESC','Mgt Area']
response = 'Total Value'
# One-hot encode categorical variables
categorical_features = predictors
one_hot = OneHotEncoder(handle_unknown='ignore')
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")
# Split the data
X = int_df_copy_cleaned[predictors]
y = int_df_copy_cleaned[response]
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Create a Pipeline with XGBoost model
pipeline = Pipeline([
('transformer', transformer),
('model', XGBRegressor(random_state=42))
])
# Parameter distributions for Grid Search
param_distributions = {
'model__n_estimators': [300],
'model__max_depth': [9],
'model__learning_rate': [0.01],
'model__subsample': [0.9]
}
# Set up K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(pipeline, param_distributions, cv=kfold, scoring='neg_mean_squared_error', verbose=2)
grid_search.fit(X_train, y_train)
# Best pipeline and parameters
best_pipeline = grid_search.best_estimator_
best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")
# Predict and evaluate on training, testing, and validation sets
y_train_preds = best_pipeline.predict(X_train)
y_test_preds = best_pipeline.predict(X_test)
y_val_preds = best_pipeline.predict(X_val)
# Calculate MSE, RMSE, MAE, and RΒ² for each set
def calculate_metrics(y_true, y_pred):
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
return mse, rmse, mae, r2
train_metrics = calculate_metrics(y_train, y_train_preds)
test_metrics = calculate_metrics(y_test, y_test_preds)
val_metrics = calculate_metrics(y_val, y_val_preds)
# Print results
print(f"Training Metrics (MSE, RMSE, MAE, RΒ²): {train_metrics}")
print(f"Testing Metrics (MSE, RMSE, MAE, RΒ²): {test_metrics}")
print(f"Validation Metrics (MSE, RMSE, MAE, RΒ²): {val_metrics}")
# Residuals vs Predicted Plot for Validation Set
residuals_val = y_val - y_val_preds
plt.figure(figsize=(8, 6))
plt.scatter(y_val_preds, residuals_val, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
# Actual vs Predicted Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_val, y=y_val_preds, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
def plot_learning_curves(model, X_train, y_train, X_val, y_val, step=50, max_data_points=1000):
train_errors, val_errors = [], []
m_values = range(1, min(len(X_train), max_data_points) + 1, step)
for m in m_values:
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_mse = mean_squared_error(y_train[:m], y_train_predict)
val_mse = mean_squared_error(y_val, y_val_predict)
train_errors.append(train_mse)
val_errors.append(val_mse)
plt.figure(figsize=(10, 6))
plt.plot(m_values, np.sqrt(train_errors), label="Train")
plt.plot(m_values, np.sqrt(val_errors), label="Validation")
plt.xlabel("Training set size")
plt.ylabel("RMSE (Root Mean Squared Error)")
plt.title("Learning Curves", fontweight='bold')
plt.legend()
plt.show()
# Using the best_pipeline for learning curve plot
plot_learning_curves(best_pipeline, X_train, y_train, X_val, y_val, step=50, max_data_points=1000)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.2s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.4s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.3s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.3s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.3s
Best Parameters: {'model__learning_rate': 0.01, 'model__max_depth': 9, 'model__n_estimators': 300, 'model__subsample': 0.9}
Training Metrics (MSE, RMSE, MAE, RΒ²): (102770.32951540327, 320.5781176490424, 94.77952698672684, 0.7014644015515108)
Testing Metrics (MSE, RMSE, MAE, RΒ²): (104532.95494427589, 323.3155655768461, 96.2668737817493, 0.4422353542782921)
Validation Metrics (MSE, RMSE, MAE, RΒ²): (97124.84710902307, 311.64859555118016, 94.83305754008255, 0.6423768627628483)
print(consensus_3FS_methods)
Feature RF ANOVA Permutation RFE RF_normalized \
2 Property Type 7.0 9.0 9.0 9.0 0.636364
8 ABANDON_REASON_DESC 9.0 3.0 8.0 3.0 0.818182
10 Mgt Area 10.0 7.0 11.0 3.0 0.909091
ANOVA_normalized Permutation_normalized RFE_normalized RF_important \
2 0.9 0.818182 0.25 1
8 0.3 0.727273 0.75 1
10 0.7 1.000000 0.75 1
ANOVA_important Permutation_important RFE_important Total_important
2 1 1 0 3
8 0 1 1 3
10 1 1 1 4
# Define predictor variables and response variable
predictors = ['Property Type','ABANDON_REASON_DESC','Mgt Area','Task_completion_time']
response = 'Total Value'
# One-hot encode categorical variables
categorical_features = predictors
one_hot = OneHotEncoder(handle_unknown='ignore')
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")
# Split the data
X = int_df_copy_cleaned[predictors]
y = int_df_copy_cleaned[response]
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Create a Pipeline with XGBoost model
pipeline = Pipeline([
('transformer', transformer),
('model', XGBRegressor(random_state=42))
])
# Parameter distributions for Grid Search
param_distributions = {
'model__n_estimators': [300],
'model__max_depth': [9],
'model__learning_rate': [0.01],
'model__subsample': [0.9]
}
# Set up K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(pipeline, param_distributions, cv=kfold, scoring='neg_mean_squared_error', verbose=2)
grid_search.fit(X_train, y_train)
# Best pipeline and parameters
best_pipeline = grid_search.best_estimator_
best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")
# Predict and evaluate on training, testing, and validation sets
y_train_preds = best_pipeline.predict(X_train)
y_test_preds = best_pipeline.predict(X_test)
y_val_preds = best_pipeline.predict(X_val)
# Calculate MSE, RMSE, MAE, and RΒ² for each set
def calculate_metrics(y_true, y_pred):
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
return mse, rmse, mae, r2
train_metrics = calculate_metrics(y_train, y_train_preds)
test_metrics = calculate_metrics(y_test, y_test_preds)
val_metrics = calculate_metrics(y_val, y_val_preds)
# Print results
print(f"Training Metrics (MSE, RMSE, MAE, RΒ²): {train_metrics}")
print(f"Testing Metrics (MSE, RMSE, MAE, RΒ²): {test_metrics}")
print(f"Validation Metrics (MSE, RMSE, MAE, RΒ²): {val_metrics}")
# Residuals vs Predicted Plot for Validation Set
residuals_val = y_val - y_val_preds
plt.figure(figsize=(8, 6))
plt.scatter(y_val_preds, residuals_val, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
# Actual vs Predicted Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_val, y=y_val_preds, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
def plot_learning_curves(model, X_train, y_train, X_val, y_val, step=50, max_data_points=1000):
train_errors, val_errors = [], []
m_values = range(1, min(len(X_train), max_data_points) + 1, step)
for m in m_values:
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_mse = mean_squared_error(y_train[:m], y_train_predict)
val_mse = mean_squared_error(y_val, y_val_predict)
train_errors.append(train_mse)
val_errors.append(val_mse)
plt.figure(figsize=(10, 6))
plt.plot(m_values, np.sqrt(train_errors), label="Train")
plt.plot(m_values, np.sqrt(val_errors), label="Validation")
plt.xlabel("Training set size")
plt.ylabel("RMSE (Root Mean Squared Error)")
plt.title("Learning Curves", fontweight='bold')
plt.legend()
plt.show()
# Using the best_pipeline for learning curve plot
plot_learning_curves(best_pipeline, X_train, y_train, X_val, y_val, step=50, max_data_points=1000)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.5s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.2s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.1s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 0.9s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 0.8s
Best Parameters: {'model__learning_rate': 0.01, 'model__max_depth': 9, 'model__n_estimators': 300, 'model__subsample': 0.9}
Training Metrics (MSE, RMSE, MAE, RΒ²): (263537.8990894734, 513.3594248569646, 150.21165381420147, 0.23445371062334097)
Testing Metrics (MSE, RMSE, MAE, RΒ²): (196154.40680784325, 442.89322280640425, 146.8169191875315, -0.04663637680816857)
Validation Metrics (MSE, RMSE, MAE, RΒ²): (266086.3908068448, 515.8356238249204, 148.7895905420939, 0.020244019023899718)
# Define predictor variables and response variable
predictors = ['Property Type','ABANDON_REASON_DESC','Mgt Area']
response = 'Total Value'
# One-hot encode categorical variables
categorical_features = predictors
one_hot = OneHotEncoder(handle_unknown='ignore')
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")
# Split the data
X = int_df_copy_cleaned[predictors]
y = int_df_copy_cleaned[response]
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Create a Pipeline with XGBoost model
pipeline = Pipeline([
('transformer', transformer),
('model', XGBRegressor(random_state=42))
])
# Parameter distributions for Grid Search
param_distributions = {
'model__n_estimators': [300],
'model__max_depth': [9],
'model__learning_rate': [0.01],
'model__subsample': [0.9]
}
# Set up K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(pipeline, param_distributions, cv=kfold, scoring='neg_mean_squared_error', verbose=2)
grid_search.fit(X_train, y_train)
# Best pipeline and parameters
best_pipeline = grid_search.best_estimator_
best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")
# Predict and evaluate on training, testing, and validation sets
y_train_preds = best_pipeline.predict(X_train)
y_test_preds = best_pipeline.predict(X_test)
y_val_preds = best_pipeline.predict(X_val)
# Calculate MSE, RMSE, MAE, and RΒ² for each set
def calculate_metrics(y_true, y_pred):
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
return mse, rmse, mae, r2
train_metrics = calculate_metrics(y_train, y_train_preds)
test_metrics = calculate_metrics(y_test, y_test_preds)
val_metrics = calculate_metrics(y_val, y_val_preds)
# Print results
print(f"Training Metrics (MSE, RMSE, MAE, RΒ²): {train_metrics}")
print(f"Testing Metrics (MSE, RMSE, MAE, RΒ²): {test_metrics}")
print(f"Validation Metrics (MSE, RMSE, MAE, RΒ²): {val_metrics}")
# Residuals vs Predicted Plot for Validation Set
residuals_val = y_val - y_val_preds
plt.figure(figsize=(8, 6))
plt.scatter(y_val_preds, residuals_val, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
# Actual vs Predicted Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_val, y=y_val_preds, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
def plot_learning_curves(model, X_train, y_train, X_val, y_val, step=50, max_data_points=1000):
train_errors, val_errors = [], []
m_values = range(1, min(len(X_train), max_data_points) + 1, step)
for m in m_values:
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_mse = mean_squared_error(y_train[:m], y_train_predict)
val_mse = mean_squared_error(y_val, y_val_predict)
train_errors.append(train_mse)
val_errors.append(val_mse)
plt.figure(figsize=(10, 6))
plt.plot(m_values, np.sqrt(train_errors), label="Train")
plt.plot(m_values, np.sqrt(val_errors), label="Validation")
plt.xlabel("Training set size")
plt.ylabel("RMSE (Root Mean Squared Error)")
plt.title("Learning Curves", fontweight='bold')
plt.legend()
plt.show()
# Using the best_pipeline for learning curve plot
plot_learning_curves(best_pipeline, X_train, y_train, X_val, y_val, step=50, max_data_points=1000)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 0.8s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 0.9s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.1s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.4s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 1.3s
Best Parameters: {'model__learning_rate': 0.01, 'model__max_depth': 9, 'model__n_estimators': 300, 'model__subsample': 0.9}
Training Metrics (MSE, RMSE, MAE, RΒ²): (336926.1944839798, 580.4534386873592, 157.39877929907104, 0.021269430802288825)
Testing Metrics (MSE, RMSE, MAE, RΒ²): (181936.92143585387, 426.5406445297492, 143.18677107939803, 0.02922496946094355)
Validation Metrics (MSE, RMSE, MAE, RΒ²): (266566.1448342174, 516.3004404745529, 148.03962246088153, 0.018477518015372563)
# Define predictor variables and response variable
predictors = ['Mgt Area','Task_completion_time']
response = 'Total Value'
# One-hot encode categorical variables
categorical_features = predictors
one_hot = OneHotEncoder(handle_unknown='ignore')
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")
# Split the data
X = int_df_copy_cleaned[predictors]
y = int_df_copy_cleaned[response]
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Create a Pipeline with XGBoost model
pipeline = Pipeline([
('transformer', transformer),
('model', XGBRegressor(random_state=42))
])
# Parameter distributions for Grid Search
param_distributions = {
'model__n_estimators': [300],
'model__max_depth': [9],
'model__learning_rate': [0.01],
'model__subsample': [0.9]
}
# Set up K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(pipeline, param_distributions, cv=kfold, scoring='neg_mean_squared_error', verbose=2)
grid_search.fit(X_train, y_train)
# Best pipeline and parameters
best_pipeline = grid_search.best_estimator_
best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")
# Predict and evaluate on training, testing, and validation sets
y_train_preds = best_pipeline.predict(X_train)
y_test_preds = best_pipeline.predict(X_test)
y_val_preds = best_pipeline.predict(X_val)
# Calculate MSE, RMSE, MAE, and RΒ² for each set
def calculate_metrics(y_true, y_pred):
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
return mse, rmse, mae, r2
train_metrics = calculate_metrics(y_train, y_train_preds)
test_metrics = calculate_metrics(y_test, y_test_preds)
val_metrics = calculate_metrics(y_val, y_val_preds)
# Print results
print(f"Training Metrics (MSE, RMSE, MAE, RΒ²): {train_metrics}")
print(f"Testing Metrics (MSE, RMSE, MAE, RΒ²): {test_metrics}")
print(f"Validation Metrics (MSE, RMSE, MAE, RΒ²): {val_metrics}")
# Residuals vs Predicted Plot for Validation Set
residuals_val = y_val - y_val_preds
plt.figure(figsize=(8, 6))
plt.scatter(y_val_preds, residuals_val, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
# Actual vs Predicted Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_val, y=y_val_preds, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
def plot_learning_curves(model, X_train, y_train, X_val, y_val, step=50, max_data_points=1000):
train_errors, val_errors = [], []
m_values = range(1, min(len(X_train), max_data_points) + 1, step)
for m in m_values:
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_mse = mean_squared_error(y_train[:m], y_train_predict)
val_mse = mean_squared_error(y_val, y_val_predict)
train_errors.append(train_mse)
val_errors.append(val_mse)
plt.figure(figsize=(10, 6))
plt.plot(m_values, np.sqrt(train_errors), label="Train")
plt.plot(m_values, np.sqrt(val_errors), label="Validation")
plt.xlabel("Training set size")
plt.ylabel("RMSE (Root Mean Squared Error)")
plt.title("Learning Curves", fontweight='bold')
plt.legend()
plt.show()
# Using the best_pipeline for learning curve plot
plot_learning_curves(best_pipeline, X_train, y_train, X_val, y_val, step=50, max_data_points=1000)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 0.7s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 0.7s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 0.8s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 0.7s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 0.7s
Best Parameters: {'model__learning_rate': 0.01, 'model__max_depth': 9, 'model__n_estimators': 300, 'model__subsample': 0.9}
Training Metrics (MSE, RMSE, MAE, RΒ²): (303951.9493127701, 551.3183738211253, 166.06555373865794, 0.11705569578744546)
Testing Metrics (MSE, RMSE, MAE, RΒ²): (199365.00642219852, 446.5030866883213, 158.70951518528477, -0.06376742373408528)
Validation Metrics (MSE, RMSE, MAE, RΒ²): (275597.48599611176, 524.9737955327978, 162.26487723616933, -0.014776758886090091)
# Define predictor variables and response variable
predictors = ['Mgt Area']
response = 'Total Value'
# One-hot encode categorical variables
categorical_features = predictors
one_hot = OneHotEncoder(handle_unknown='ignore')
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")
# Split the data
X = int_df_copy_cleaned[predictors]
y = int_df_copy_cleaned[response]
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Create a Pipeline with XGBoost model
pipeline = Pipeline([
('transformer', transformer),
('model', XGBRegressor(random_state=42))
])
# Parameter distributions for Grid Search
param_distributions = {
'model__n_estimators': [300],
'model__max_depth': [9],
'model__learning_rate': [0.01],
'model__subsample': [0.9]
}
# Set up K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(pipeline, param_distributions, cv=kfold, scoring='neg_mean_squared_error', verbose=2)
grid_search.fit(X_train, y_train)
# Best pipeline and parameters
best_pipeline = grid_search.best_estimator_
best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")
# Predict and evaluate on training, testing, and validation sets
y_train_preds = best_pipeline.predict(X_train)
y_test_preds = best_pipeline.predict(X_test)
y_val_preds = best_pipeline.predict(X_val)
# Calculate MSE, RMSE, MAE, and RΒ² for each set
def calculate_metrics(y_true, y_pred):
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
return mse, rmse, mae, r2
train_metrics = calculate_metrics(y_train, y_train_preds)
test_metrics = calculate_metrics(y_test, y_test_preds)
val_metrics = calculate_metrics(y_val, y_val_preds)
# Print results
print(f"Training Metrics (MSE, RMSE, MAE, RΒ²): {train_metrics}")
print(f"Testing Metrics (MSE, RMSE, MAE, RΒ²): {test_metrics}")
print(f"Validation Metrics (MSE, RMSE, MAE, RΒ²): {val_metrics}")
# Residuals vs Predicted Plot for Validation Set
residuals_val = y_val - y_val_preds
plt.figure(figsize=(8, 6))
plt.scatter(y_val_preds, residuals_val, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
# Actual vs Predicted Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_val, y=y_val_preds, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Plot (Validation Set)', fontweight='bold')
plt.show()
def plot_learning_curves(model, X_train, y_train, X_val, y_val, step=50, max_data_points=1000):
train_errors, val_errors = [], []
m_values = range(1, min(len(X_train), max_data_points) + 1, step)
for m in m_values:
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_mse = mean_squared_error(y_train[:m], y_train_predict)
val_mse = mean_squared_error(y_val, y_val_predict)
train_errors.append(train_mse)
val_errors.append(val_mse)
plt.figure(figsize=(10, 6))
plt.plot(m_values, np.sqrt(train_errors), label="Train")
plt.plot(m_values, np.sqrt(val_errors), label="Validation")
plt.xlabel("Training set size")
plt.ylabel("RMSE (Root Mean Squared Error)")
plt.title("Learning Curves", fontweight='bold')
plt.legend()
plt.show()
# Using the best_pipeline for learning curve plot
plot_learning_curves(best_pipeline, X_train, y_train, X_val, y_val, step=50, max_data_points=1000)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 0.3s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 0.4s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 0.4s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 0.4s
[CV] END model__learning_rate=0.01, model__max_depth=9, model__n_estimators=300, model__subsample=0.9; total time= 0.4s
Best Parameters: {'model__learning_rate': 0.01, 'model__max_depth': 9, 'model__n_estimators': 300, 'model__subsample': 0.9}
Training Metrics (MSE, RMSE, MAE, RΒ²): (344242.7170365441, 586.722010015428, 172.0893916539786, 1.5801967088058255e-05)
Testing Metrics (MSE, RMSE, MAE, RΒ²): (187493.71296984018, 433.00544219425257, 158.8834647860307, -0.00042483679350957537)
Validation Metrics (MSE, RMSE, MAE, RΒ²): (271827.26603099983, 521.3705649832946, 163.8061676095854, -0.0008944421347032439)
Purely from Evaluation Metrics PerspectiveΒΆ
Model-1 & Model-2 (11 predictors, with and without 'Task Completion Time'):ΒΆ
Model-1 (With 'Task Completion Time'):ΒΆ
- Shows a balanced performance in terms of accuracy and reliability. Insight: Incorporating 'Task Completion Time' slightly improves the model's ability to predict costs accurately, suggesting a useful link between the duration of tasks and their cost.
Model-2 (Without 'Task Completion Time'):ΒΆ
- Comparable to Model-1, with a minor trade-off in predictability for unseen data. Insight: While the model remains effective, it indicates that 'Task Completion Time' adds a small but valuable dimension to cost prediction.
################################################################################################################
Model-3 & Model-4 (Ensemble Mix, with and without 'Task Completion Time'):ΒΆ
Model-3 (With 'Task Completion Time'):
- Shows a noticeable drop in performance for unseen data. Insight: This model suggests that with fewer predictors, 'Task Completion Time' alone is not sufficient to maintain high prediction accuracy.
Model-4 (Without 'Task Completion Time'):
- Similar in performance to Model-3, slightly more reliable on unseen data. Insight: Indicates that other predictors are also key in determining repair costs, not just 'Task Completion Time'.
################################################################################################################
Model-5 & Model-6 (Consensus by 3 feature selection techniques, with and without 'Task Completion Time'):ΒΆ
Model-5 (With 'Task Completion Time'):ΒΆ
- Struggles with prediction accuracy across all data sets. Insight: Suggests that a more extensive set of predictors is necessary for accurate cost estimation, beyond just 'Task Completion Time'.
Model-6 (Without 'Task Completion Time'):ΒΆ
- Similar in performance to Model-5, further emphasizing the need for a broader set of predictors. Insight: Reinforces the idea that multiple factors must be considered together for effective cost prediction.
################################################################################################################
Model-7 & Model-8 (Consensus by all 4 feature selection techniques, with and without 'Task Completion Time'):ΒΆ
Model-7 (With 'Task Completion Time'):ΒΆ
- Performs poorly in predicting repair costs accurately. Insight: Highlights that 'Task Completion Time', especially in a limited predictor set, cannot reliably predict repair costs.
Model-8 (Without 'Task Completion Time'):ΒΆ
- Similar poor performance to Model-7.
################################################################################################################
Business Insights:ΒΆ
- Reinforces that relying on very few predictors, even without 'Task Completion Time', does not yield accurate cost predictions.
Some Recommendations:ΒΆ
Task Completion Time's Influence: While 'Task Completion Time' has a role in predicting repair costs, its impact is modest and more pronounced when combined with a comprehensive set of predictors.
Predictor Synergy: The combination of 'Task Completion Time' with factors like contractor type, property type, and management area seems to provide a more complete picture for predicting repair costs.
Focus on Comprehensive Models: Models with a broader range of predictors tend to be more reliable. Thus, focusing on a holistic approach that considers multiple aspects of a repair job is more beneficial for accurate cost estimation.
Driving Down Repair Costs: Understanding the relationship between task duration and cost can help in optimizing operations. Shortening completion times where feasible, without compromising on quality, might lead to cost savings.
Data-Driven Decision Making: Continuously refining the predictive models with updated data and exploring additional relevant predictors can enhance the accuracy of repair cost forecasts, aiding in better budgeting and resource allocation.
Comparative Modelling Output Analysis (with same set of predictors and "Total Value" as response) - Ensemble Modelling Techniques (Random Forest and Gradient Boost)ΒΆ
predictors = ['Property Type', 'Jobsourcedescription', 'ABANDON_REASON_DESC', 'JOB_STATUS_DESCRIPTION', 'TRADE_DESCRIPTION', 'Latest Priority Description', 'Mgt Area', 'CONTRACTOR', 'Task_completion_time'] response = 'Total Value'
Performance Comparison Analysis between Ensemble Models from new feature ("Task_completion_time") perspective:ΒΆ
a) The Gradient Boosting models with 'Task Completion Time' do show a trend towards improved performance over Random Forest models without this feature. This suggests that the time aspect of repairs is an important factor in cost prediction, which Random Forest models did not fully capture due to a statistical significant though albeit a weak linear relationship.
Pattern Recognition in Model Performance:
b) Features like 'Property Type', 'Mgt Area', and 'Contractor' are significant in both model types. The consistent impact of these features across different models reinforces their relevance in predicting repair costs. The improvement seen with the inclusion of 'Task Completion Time' in Gradient Boosting models indicates its relevance, which is not captured in Random Forest models. This could be a point of consideration for future model enhancements.
Contrastive Performance Evaluation of Ensemble Modelling Algorithms (with new predictor "Task_completion_time")ΒΆ
Model Overfitting:ΒΆ
Random Forest:ΒΆ
Shows signs of overfitting, as indicated by a substantial drop in performance from training to testing (e.g., RΒ² from 0.861 to 0.306). This suggests that the model may be too complex or too closely fitted to the training data, failing to generalize well to new data.
Gradient Boost (XGBoost):ΒΆ
Exhibits a smaller performance drop from training to testing (RΒ² from 0.825 to 0.383). While there's still a decline, itβs less pronounced, suggesting better management of overfitting compared to Random Forest.
Generalisibility:ΒΆ
Random Forest:ΒΆ
Less generalizable due to significant performance discrepancies between training and testing dataset.
Gradient Boosting (XGBoost):ΒΆ
Demonstrates better generalizability, maintaining more consistent performance across differendatasets.
Predictability:ΒΆ
Random ForestΒΆ
Has higher training RΒ² but lower testing RΒ² and higher testing RMSE compared to XGBoost, indicating less predictive accuracy on unseen data. Gradient Boosting(XGBoost): Despite a drop in RΒ² on testing data, it maintains a higher RΒ² and lower RMSE on testing data compared to Random Forest, suggesting better predictability on unseen test data.
Observations on ensemble modelling.ΒΆ
Some Generic and other factors leading to better performance:ΒΆ
Random Forest:ΒΆ
Robust to outliers and can handle non-linear data well. Prone to overfitting if not tuned correctly, especially in cases with a large number of trees.
Gradient Boost (XGBoost):ΒΆ
Iteratively corrects errors of previous trees, leading to improved accuracy. Better handling of various data types and distributions. More prone to overfitting than Random Forest but often achieves better performance with proper parameter tuning.
Conclusion:ΒΆ
Model Overfitting: Gradient Boosting shows a more balanced approach to preventing overfitting compared to Random Forest. Generalisability: XGBoost outperforms Random Forest in terms of generalising to unseen data. Predictive Ability: XGBoost demonstrates superior predictive ability, particularly on testing data, as indicated by higher R2 and lower RMSE values. Superior Performance: XGBoost's iterative approach to correcting errors and its flexibility with various data types contribute to its superior performance, especially when parameters are finely tuned to reduce both model bias and variance both.
Summary:ΒΆ
Good Performance: Gradient Boosting's good performance can be attributed to its iterative nature, allowing it to learn complex patterns and interactions between features.
Less Optimal Performance: Random Forest's less optimal performance on unseen data suggests it may not be capturing the complex interactions between 'Task_completion_time' and other predictors as effective.
As observed, it is overfitting on the training data more in comparsion to gradient boosting due to hign collinearity among the predictors and the model's high overfitting is due to its inability to minimize the validation loss in comparison to the training loss. Also, Gradient Boosting model generally shows better relative performance in this specific important context of model generalisibility on unseen test data.
High Level Data Analysis Observations:ΒΆ
Exploratory and Descriptive AnalysisΒΆ
Data Distribution, Missingness, Skewness and other analysesΒΆ
Diverse Predictors:ΒΆ
It has a mix of numeric and categorical predictors with the mix of date, and numeric predictors (feature engineered task completion time) that are qualitative and time-based information.
Data Skewness and Missing DataΒΆ
Task completion time is skewed to the right suggesting that most repair tasks are completed within a shorter time frame, but there are also a significant number of tasks taking much longer.
Data is missing in some important context like "Date Comp" (i.e Repair Completion Time) with 790 records missing entirely for single categories ( for e.g. Painting and Decorating, Play and Recreation, and others).
Also, data is missing for many categorical variables as we have mentioned in our analysis above.
Data Missingness Analysis and Imputation challenges in our project context:ΒΆ
As we have observed in statistical (chi-square independence test) and other tests, the missing pattern of data here does not suggest "MCAR" ("Missing Completely at Random") in which case it would have been easy to impute it with standard imputation techniques as it is missing at random and does not depend on any bserved or unobserved data.
(i.e. the reasons for the missing data are unrelated to the data itself)
Our data is falling majorly into "MAR" or "MNAR" buckets as confirmed by the test.
In case of MAR; This means that the cause of the missing data can be explained by other variables in the dataset that are not missing. Though data imputation can be done on missing values which are predicted based on other available data, but the analysis might be biased if these relationships are not properly accounted for.
In case of MNAR; The missingness of data is related to the reasons that lead to its being missing. That is, the missing data depends on unobserved data or the value of the missing data itself. Here the data imputation can be very challenging as these missingness is not accounted for without any knowledge and would introduce bias with imputation.
For Our modelling: We decided to exclude the records with missing values for ( Task completion in days) because "Day Comp" missingness.
- Data Distribution and other Nunaced Relationships: The Repair value, and task completion time does not follow a normal distribution pattern with "Repair value" being highly right skewed with a long tail to the right as analyzed with various plots. Task completion has right skewness at a moderate level.
Univariate Analysis (Detailed Summary in appendix):ΒΆ
In summary, we oberved most frequent count patterns.
Bivariate Analysis(Detailed Summary in appendix):ΒΆ
In summary, we observed various frequent(high frequencies) relationships among different independent predictors.
Timeline Trend Analysis of repair complaints logged and solved ; This is just to understand the sequence of logging and solved complaints over year. month and days with various interactive line plots.
Descriptive AnalysisΒΆ
To understand and validate the earlier observed pattern of data distribution and skewness with statistical measures (box plots, Q-Q plots etc).
Clustering Analysis (on a very limited scale)ΒΆ
- Repair costs based on Job Status and its Initial Priority:
We carried out this on a limited scale as this is a high-dimensional categorical data. So High-dimensional categorical data would have challenges as distance measures become less meaningful, and need to be reduced to few categories("curse of dimensionality" problem). As a way around, we have carried out many mult-level bi-variate analyses though it's a count plot based on 2-variables.
Feature Engineering:ΒΆ
Creating a new predictor ("Task_Completion_Time") from existing predictors ("Day Logged", "Day Comp") to understand its impact on house repairs cost.
Analyses specific to Modelling Assumptions Prior to Modelling exercise:ΒΆ
Linear Regression AssumptionsΒΆ
- Data collinearlity analysis techniques;
a) Chi-square test of independence: Using categorical variable association or contingency table analysis This is to determine whether there is a significant association between two categorical variables. It showed 52 variable pairs have significant associations out of (=53).
b) Variance inflation factor (VIF): This is by converting categorical predictors into numerical values (e.g., through dummy coding) to assess collinearity among predictors
2- Data distribution (Linearity check) of target repairs cost for various categorical variables.
FindingsΒΆ
Non-linear distribution of (categorical predictors-repair cost) for a majority pairs of variables due to repair cost skewness as found earlier.
High level of collinearity among predictors.
L2 Regualrization attempt (Ridge Regression)ΒΆ
This attempt was to mitigate these high level collinearity among predictor variables to make it suitable for modelling.
Findings:ΒΆ
The data encoding (Ridge regression) created a huge number of dummy variables (due to original multi-level categorical data) in the data, which makes the correlation problem more woese.
Post L2 regularization, VIF test still showed the presence of collinarity among predictors, we did not carry out L1 regression as it would have nullified the contributon of many predictors in the modek due to its inherent nature.
Post modelling diagnostics :ΒΆ
- Residual plot - Showed heteroscedasticity (non constant variance) signifying presence of a pattern among residuals, and models inability to completely explain all of them.
- Autocorrelation plot (ACF) - Though this is relevant for time-series analysis, just to confirm, it Showed no pattern/trend among residuals.
- Q-Q plot of residuals: It showed data non-normality
- Regression plot (Actual vs Predicted values with Best-Fit line) It showed prediction completely being off the mark with almost a straight line like curve.
Conclusion (Linear and Ridge Regression):ΒΆ
All these findings made linear/L2 regression unsuitable for our modelling.
Alternative Modelling StrategiesΒΆ
Two Ensemble Modelling Techniques (Random Forest and Gradient Boosting-XGB)ΒΆ
Detailed comparative analysis has been done;
Models performance and comparative post diagnosis performance analysis has been carried out in a multi-step approach:
Different set of important predictors determined through 4 feature selection techniques has been used in different modelling stages; This is based on majority consesus among 4 feature selection techniques based on consesus threshold.
Modelling involves hyperparameter tuning to determine optimal hyparameters
Post diagnosis of results through evaluation metrics like MSE, RMSE, R2, and MAE etc.
Feature Importance Selection Techniques Employed:ΒΆ
- Random Forest Regressor
- Anova F-Test
- Permutation Feature Importance(PFI)
- Recursive Feature Ellimination (RFE)
Modelling Evaluation Strategies:ΒΆ
First Strategy - Without the inclusion of new feature variable ("Task_Completion_Time")ΒΆ
How the modelling evaluation has been done to predict "Total Repair Cost" response variable
- Random forest has been executed on predictors determined through consensus among 1,2,3, and 4 future selection techniques i.e. executed 4 times
- Similarly, Gradient Boosting(XGBoost) has been executed 4 times.
Second Strategy - With the inclusion of new feature variable ("Task_Completion_Time")ΒΆ
How the modelling evaluation has been done to predict "Total Repair Cost" response variable
- Random forest has been executed on predictors determined through consensus among 1,2,3, and 4 future selection techniques i.e. executed 4 times
- Similarly, Gradient Boosting(XGBoost) has been executed 4 times.
Third Strategy - Comparing the performance between Models from Strategy-1 and 2;ΒΆ
This is to understand the impact of new predictor ("Task_Completion_Time") and its importance in predicting total repair cost in contrast to strategy-1 (without the ("Task_Completion_Time"))
Modelling Evaluation Diagnosis, Insights and RecommendationΒΆ
Notes- All are explained in-line in the notebook.
#######################################################################################################################
Alternate ModellingΒΆ
Time Series Analysis and ForecastingΒΆ
Notes-
- This is as an extra attempt to understand the trend and seasonality on possibly tempotal nature of the data (using "date logged") and forecast repair cost of time.
- Data Stationarity check, differencing techniques (standard and seasonal differencing) has been tried though without any relevant outcomes of repair cost forcasting.
Limitations of Time Series Analysis (in this context):ΒΆ
First of all, we have limited data (of 18 months)
As observed here, Data based on 'Day Logged" and "Comp Date" has no trend or seasonality of existing repair costs.
Initial Data non-stationarity due to high level of right skewness as observed in "repair cost" (like variance changes or shifting means) violating time series assumption.
Obviously, different orders of data differencing does not help here ( as tried) along with the reduction of more data points with each differencing.
Multivariate complexity and different predictors influencing repair cost.
Appendix:ΒΆ
###########################################################################################################################
Univariate Analysis (Numerical Count and % data):ΒΆ
JOB_TYPE_DESCRIPTION:ΒΆ
Shows "Responsive Repairs"(63.29%) are highest, followed by "Gas Responsive Repairs"(19.63%)
Contractor:ΒΆ
A large disproportnate (approx. 90%) number of contractors are not named, and seem to be potentially anonymous or data entry errors.
Property Type:ΒΆ
Property Types ("Terrace", End Terace" and "Access Direct") dominate the peck of housing repair requests in that order of approx (34.58%, 25.3%, 15.66%) respectively.
Jobsourcedescription -ΒΆ
A high volume of repair request calls originate from "CSC Phone call" (=60.81%) followed by via website (=15.75%).
Initial Priority Description:ΒΆ
Though not very pronounced, but a majority of request initial priorities are "Appointable" (=%39.36%) followed by missing blank "priority descriptions" (=18.71%) and Emergency requests (=16.77%).
Latest Priority Description:ΒΆ
But in contrast to initial priority of repair requests, priority status change in same uniform manner as refelected here with a majority of "Appointable" (=53.2%), Emergency requests (=17.11%), and Urgent PFI Evolve RD Irvine EMB (=10.78%) requests with very few missing latest priority descriptions (=199).
Job Status Description:ΒΆ
A very significant number of repairs status has been updated as "Invoice Accepted" (=75.57%) followed by "Abandoned" (=18.95%) status.
Trade Description:ΒΆ
Though dominant but a moderately no. of repair requests belong to "Gas Repairs" (=20.87%) followed by Carpenter (=18.31%) and Plumbing (=15.48%).
Abandon Reason Description:ΒΆ
Similarly though dominant but a moderate no. of repairs have been abandoned with reasons "No Work Required" (=21.77%) followed by "Alternative Job" (=20.46$) , "No Access" (=16.86%) and "Duplicate Order" (=10.04%).
###########################################################################################################################################
Some Key findings of the above feature predictors:ΒΆ
Types of Repairs:ΒΆ
The dataset includes a wide variety of repair types. The bar plot indicates that some types of repairs are much more common than others. "Gas Responsive Repairs" seems to be the most frequently occurring type, suggesting a high demand for gas-related repair services or possibly a recurring issue with gas systems.
Contractors:ΒΆ
There is a significant variation in the number of jobs handled by different contractors. Some contractors ("N/A"s) have a much higher count, indicating that a few contractors are handling a large volume of the work. This could be due to the size of the contractor companies, their specialization, or contractual agreements.
Year of Build:ΒΆ
Year of build shows that a large portion of the properties were built in a certain period, which appears to be around the 1960s and 1970s. There's a noticeable decline in buildings from newer years, suggesting that either newer properties require fewer repairs or the dataset covers a demographic with older properties.
Property Types:ΒΆ
property types plot reveals that certain property types, like "Terrace" and "Semi Detached," are more common in the dataset. This may reflect the housing stock in the area or indicate that these property types are more prone to repairs.
Priority of Jobs:ΒΆ
The majority of repair jobs are categorized with lower urgency, as seen by the tall bars for categories like "Urgent Call Out" and "Appointments." High urgency or emergency jobs are less frequent, which could indicate an efficient maintenance schedule or fewer emergency issues.
Average Cost per Job Type:ΒΆ
There is a noticeable variation in the average cost per job type. Some job types, like "Fire Risk Assessment" and "Warden Call Equipment," show a higher average cost, which may be due to the complexity or the specialized skills required for these repairs.
Repair Complaints Logged and Solved Dates:ΒΆ
A significant increase in the number of repair complaints from 2022 to 2023 (with 6,780 in 2022 and 14,506 in 2023 with an increase of 53.26%.)
Pattern of Repair Log Requests:ΒΆ
Also, there is a steady increase of complaints from mid year(June) to Nov, with a sharp drop in Dec, which could be cyclica that we would investigate.
Broadly, The lack of a clear seasonal trend suggest that the factors affecting the number of repair complaints are more complex and could be related to specific events or changes in operation, opertaional management/contractor factots rather than the time of the year.
Job Type vs. Initial Priority DescriptionΒΆ
Cross-tabulate the types of repair jobs with their initial priority descriptions to understand the distribution of priority levels for each job type.
1- We can see that the Job Types ("Responsive Repairs" and "Gas Responsive Repairs") that are of "Appointable" priority dominate, follwed by same Job Types ("Responsive Repairs" and "Gas Responsive Repairs") that are of "Emergency" priority.
2- This is followed by "Communal Responsive Repairs" that falls under "Appointable" priority though in reduced nos (=147).
################################################################################################################################
Job Type vs. Latest Priority Description:ΒΆ
Cross-tabulate the types of repair jobs with their latest priority descriptions to understand the distribution of priority levels for each job type.
1- The latest priority of these job types stays same as encouneterd with initial priority types, though the nos tend to increase as expectedly so during the time span as new jobs gets added. ################################################################################################################################
Property Type vs. Initial Priority Description:ΒΆ
Examine how the initial priority of repairs varies across different property types.
1- We can see that the "Terrace", "End Terrace", and "Access Direct" properties with "Appointable" and "Emergency" Repair priorties dominate. 2- We also see that for many these properties types the priority has not been updated as they are missing in the data. ################################################################################################################################
Property Type vs. Trade Description:ΒΆ
Investigate if there are certain linkages between certain types of properties and the repair task that needs to be carried out.
1- "Terrace", "End Terrace", "Access Direct", and "Access via internal shared area" type properties are more prone to repairs type tasks like "Gas Repairs", "Carepentry", "Plumbing", and "Electricity Repair" type tasks.
2- In general, "Terrace" and "End Terrace" type of properties have most repair requests. ###############################################################################################################################
Property Type vs. Abandon Reason Description:ΒΆ
Investigate if there are any type of properties that are more prone to abandonment.
1- Here we can see primarily the "Terrace", "End Terrace", "Access direct", and "Access via internal shared area" type of properties are being abanadoned for repair service in that order.
2- Though "Access via internal shared area" and "Semi Detached" properties are being abandoned on a much lower scale as compared to other mentioned property types.
3- The primary reasons for abandoning the service are "No work required", "Alternative Job", "No Access" , "Duplicate Order", and "Tenanr Missed Apt".
4- Also, "Tenant Refusal", and "Input Error" are other reasons for task abandonment though on a very low scale.
5- Interstingly, we see here "Access via internal shared area" are being abandoned due to wrong contractor assignment (=94).
################################################################################################################################
Property Type vs. Job Status Description:ΒΆ
Analyze the relationship between the property type and the status of repair jobs.
1- We can see that "Invoice Accepted" is the dominant job status for ("Terrace", "End Terrace", "Access Direct", and "Access via internal shared area") type of properties have.
2- It seems this is predominantly due to the fact that these properties have high volume of repair requests.
3- These 4 types of properties are also abandoned the most for repair service.
################################################################################################################################
Mgt Area vs. Abandon Reason Description:ΒΆ
Explore the distribution of abandon reason across different management areas, and see if these work abandonments are concentrated across any specfic mangagement area dealing with contractors or there is no such pattern.
1- We can see here that "MA1" is being an outlier in the form of most properties under it are being abandoned by the contractors for different reasons as mentioned below.
2 - Most importantly, we can see here that, Mgt Area (MA1) is abandoing the property repair service requests primarilry due to reasons ("No work required", "Alternative Job", "No Access", "Duplicate Order", and "Tenant Missed Apt" reasons.
2- We need to priortise the focus on management ("MA1") to know the reasons for abandonment and improve the service level.
3- Also, need to understand why there are disproportionately higher number of requests are being routed through "MA1". This will allow for potential optimal allocation of resources, skilled resource augmentation etc.
###############################################################################################################################
Mgt Area vs. Trade Description:ΒΆ
Explore the distribution of repair trades across different management areas.
1- We know from earlier that Repair Trades like ( "Gas Repairs", "Capentry", "Plumbing", and "Electric Repairs") are the dominant trades that are abandoned the most in ("Terrace", and "End Terrace") type properties.
2- We see here that Mgt Area ("MA1" ) is highly engaged with these skilled trades than other managements.
3- This calls for more analysis to understand whether the MA1 is overloaded with bulk load requests, and more work load balancing is required in the form of optimal resource allocation or skill augmentation with more skilled contractors.
#########################################################################################################################
Trade Description vs. Abandon Reason Description:ΒΆ
Examine the reasons for abandoning repair jobs within specific trade categories.
1- "Gas Repair", "Carpentry", "Plumbing", and "Electric Repairs" jobs are being abandoned predominantly in that order.
2- The primary reasons being "Alternative Job", "No Access", "No Work required", "Duplicate Order", though "Tenant Missed Apt", "Tenant Refusal" are other reasons but on a very much lower scale in comparsion to others. #########################################################################################################################
Initial Priority Description vs. Job Status Description:ΒΆ
Explore the relationship between the initial priority of repairs and their current status.
1- Repair tasks that are logged with ("Appointable", "Emergency", and "Urgent PFI Evolve RD Irvine EMB") priorties have most no of "invoices accepted" with ""Appointable" priority types leading the list.
2- Understandably, these priority tasks ("Appointable", "Emergency", and "Urgent PFI Evolve RD Irvine EMB") also lead the list of abandoned type tasks ( 1645, 502, 457) no of tasks anadoned.
3- Notably, As seen here these priortity type jobs though are in large numbers, but they seems to be in unresolved status as refelected in job status ("Work Completed") with very negligible completion nos (41, 29, 18) respectively. #########################################################################################################################
Latest Priority Description vs. Job Status Description:ΒΆ
Explore the relationship between the latest priority of repairs and their current status.
1- Repair tasks that have final priorties ("Appointable", "Emergency", and "Urgent PFI Evolve RD Irvine EMB") have most no of "invoices accepted" with ""Appointable" priority types leading the list. 2- Understandably, these priority tasks ("Appointable", "Urgent PFI Evolve RD Irvine EMB" and "Emergency") also lead the list of abandoned type tasks ( 2331, 543,533) no of tasks abandoned. 3- Notably, As seen here these priortity type jobs ("Appointable", "Emergency") though are in large numbers, but they seems to be in unresolved status as refelected in job status ("Work Completed") with very negligible completion nos (61, 46) respectively. 4- We can notice, there is a gradual increase in the no. of jobs getting registered as ("Appointable", "Emergency", and "Urgent PFI Evolve RD Irvine EMB") possibly due to accumulation of their incompletion status and addition of new job requests. #########################################################################################################################
Property Type vs. Contractor:ΒΆ
1- We can see here that though there are in total 30 contractors being requisitioned for various repair jobs, predominantly only two contractors very few no. of (=4) contractors (27,16,5,and 29) are predominantly being tasked with "terrace" property types which has the most no of repair requests.
2- Similarly, very few no of contractors (=7) are being utililised for "End Terrace" property types.
3- Most of the contractors we can see are being deployed for few repair tasks (e,g, contractor 13, 2, 29 and 21 and others).
4- It could be due to the fact that many contractors don't have necessary expertise for certain job roles like as repairs, carpentry, plumbing, and electric repairs or they are being underutlized. This needs to be further investigated.
5- This calls for either augmenting the contractors resource pool with requsite skilled agencies or to optimize the allocation of existing resources in case of underutilisation or enhancing the existing contractor's skills to access the "Terrace" type properties. This also needs to be further investigated.
6- Interestingly, We can see that almost all service requests for "terraced", "End terraced", "Semi Detached", "Access direct", and "Access direct via internal shared area" properties are being routed through contractor without any ID ("N/A").
7- This seems potentially suspicious or it could be a data entry error, or anoymised which needs to be investigated.
#########################################################################################################################
Job Type vs. Job Status:ΒΆ
Analyze how different job types correspond to different job statuses. This can help in understanding the completion status of different types of repair jobs.
1- As we see, "Responsive Repairs" and "Gas Responsive Repairs" are the dominant job types with large no of "invoice accepted", and these are also the highly abandoned jobs in the list with lesser completions in comparsion to their "invoice accepted" status job status.
#########################################################################################################################
Contractor vs. Abandon Reason Description:ΒΆ
Investigate to know any pattern of abandon reasons given by contractors . This can highlight areas where certain contractors may need additional support or training.
1- We can see now predominantly that almost 10 contractrors (out of total 30 contractors) having abandoned the property to service due to its inaccessibility, which resulted in a pile up of request backlogs with large no of incomplete jobs.
2- This narrows down the scope to the checking the reasons primarily for "terrace" and "end terrace" type properties being inaccesible due to either lack of contractor's skill levels or other potential reasons.
3- As confirmed earlier, the bulkload of abandonments are by the "N/A" contractor engaged in the skliied trades due to the descriptions as found out earlier.
#########################################################################################################################
Contractor vs. Job Type Description:ΒΆ
Investigate to know any pattern of abandon reasons given by contractors . This can highlight areas where certain contractors may need additional support or training.
1- As we know from univariate analysis that predominant job types are of "Responsive Repairs" and "Gas Responsive Repairs", but we can see that only couple of Contractors being entasked with 4 "Responsive Repairs" with no contractors assigned to handle "Gas Responsive Repairs".
2- This finding potentially highlights the need for skill augmentation of the existing contractor pool with either training or replacing the existing pool with the skilled staff.
##########################################################################################################################
Service Type Description vs. Job Status:ΒΆ
Investigate to know any specific area is more prone to work abandonment. This will allow us to focus more on that area by diversion of optimal no of resources.
1- As we know from univariate analysis that predominant job types are of "Responsive Repairs" and "Gas Responsive Repairs", but we can see that only couple of Contractors being entasked with 4 "Responsive Repairs" with no contractors assigned to handle "Gas Responsive Repairs".
2- This finding potentially highlights the need for skill augmentation of the existing contractor pool with either training or replacing the existing pool with the skilled staff.
##########################################################################################################################
Day of Date Logged vs. ABANDON_REASON_CODE:ΒΆ
Investigate patterns between the day a repair job was logged and the reasons for abandonment. This can help in identifying any temporal trends related to abandoned repair jobs. ##########################################################################################################################
Job Type vs. Job Status:ΒΆ
Analyze how different job types correspond to different job statuses. This can help in understanding the completion status of different types of repair jobs. ##########################################################################################################################
Property Type vs. Initial Priority Description:ΒΆ
Explore the relationship between property types and the initial priority assigned to repair jobs. This can provide insights into the urgency of repairs for different property types.
Appendix- Clustering Analysis-ΒΆ
Clustering - Repair costs based on Job Status and its Initial PriorityΒΆ
It suggest a clear categorization of repair jobs based on cost, with each type of job status and priority having distinct cost characteristics. Emergency repairs seem to have more predictable costs, whereas non-emergency repairs show more variability.
Clusters 0, 1, and 2:ΒΆ
Here, the clusters are spread more widely on the log-transformed cost axis, which implies a greater variation in the cost of responsive repairs.
Specifically:
Cluster-0: Represents appointable responsive repairs with a wide range of costs. Cluster-1: Corresponds to emergency responsive repairs, also showing a wide range of costs but slightly less variation than Cluster-0. Cluster-2: Relates to a specific type of urgent responsive repairs and shows a similar spread to Cluster-0, which suggests a variability in costs as well.
Dominance of Clusters based on spread and placement: Clusters 0, 1, and 2 towards the center and right indicates that these types of repairs are more variable and potentially more costly. In contrast, Clusters 3 and 4 show less variability and lower costs, suggesting that gas repairs, whether emergency or not, are less varied and perhaps subject to more standardized pricing.
key observations are the distinct groupings of repair costs by job type and priority, with gas repairs standing out for their lower and less variable costs. This could influence how resources are allocated, how pricing models are developed, and how services are scheduled. The smaller number of data points in Clusters 3 and 4 might also suggest a need for focused analysis on gas repairs to understand their cost structure and frequency better
Potential Features Impacting House Repair Cost :ΒΆ
( As observed from different feature selections and model performance behaviour):ΒΆ
predictors = ['JOB_TYPE_DESCRIPTION', 'CONTRACTOR', 'Property Type', 'Jobsourcedescription', 'Latest Priority Description', 'JOB_STATUS_DESCRIPTION', 'TRADE_DESCRIPTION', 'ABANDON_REASON_DESC', 'Mgt Area', 'Task_Completion_Time"] response = 'Total Value'
Predictor Significance: The analysis suggests that while 'Task Completion Time' is a relevant factor, other predictors like 'Property Type', 'Management Area', and 'Contractor', 'Trade Description' play a more pivotal role in determining repair costs.
1. Inclusion of 'Task Completion Time':ΒΆ
The addition of 'Task Completion Time' shows a nuanced impact. In some models, it slightly improves the model's ability to generalize to unseen data, while in others, it does not have a significant effect.
This predictor ('Task Completion Time') needs to be investigated further in association with other features to understand its impact.
2. Other predictors influencing the model behaviour:ΒΆ
###########################################################################################################################
Note- As we know, there are non-linearity observed between different predictors and repair value as observed earlier in our initial scatter plot.
As observed from EDA:ΒΆ
- JOB_TYPE_DESCRIPTION - Certain repairs like "Void Repairs", "Responsive Repairs", "Suspected Damp" seems to have a large clusters of data points, indicating the volume of tasks undertaken.
- CONTRACTOR - "N/A" category contractors dominate in carrying out the most repair tasks.
- Mgt. Area - "MA1" category management area dominate in supervising a very large no of repair tasks.
- Jobsourcedescription - Primarily "Total Mobile App", "One Mobile App", and "CSC Phone call" are the orgin of a large no of repair complaint logs.
- Property Type - Property types "Terrace", "Access Direct" , and "End Terrace" has most number of repair requests.
- Initial Priority Description - Initial complaint priority ("Two week void") dominate the list of all initial priorities.
- Final Priority Description - Similarly, Final complaint priority ("Two week void") dominate the list of all final priorities.
- TRADE_Description - "Carpentry" trade dominate in a large no of repairs.
- ABANDON_REASON_DESC - Interstingly, all the valid abandon reason codes have Total repair values assigned as '0', while non-assigned abandon reason codes (blank/null) has all repair values in the original data, which scatter plot reflects as a straignt line at 0.
- JOB_STATUS_DESCRIPTION- "Job Logged", "Invoice Accepted", and "Work Completed" dominate the list in that order for which most repair tasks have been carried out. #########################################################################################################################
Generic Notes About Predictors:ΒΆ
JOB_TYPE_DESCRIPTION: Describes the type of repair job. Different types of jobs can have varying costs due to factors like complexity, materials required, and skill level needed.
CONTRACTOR: This field might indicate the contractor or company responsible for the repair. Costs can vary based on the contractorβs pricing, efficiency, and quality of work.
Mgt Area: The management area could denote different geographical regions, each with its own labor and material cost variations.
Property Type: Different types of properties, such as semi-detached, detached, or apartments, can have different repair needs and associated costs.
Jobsourcedescription: The source of the job could indicate the context of the repair (e.g., routine maintenance versus emergency repair), which can influence cost.
Latest Priority Description: This may reflect the urgency or importance of the job, potentially affecting how quickly and expensively it needs to be addressed.
JOB_STATUS_DESCRIPTION: The current status of the job could influence costs, especially if it reflects ongoing changes or issues in the repair process.
TRADE_DESCRIPTION: Different trades (e.g., plumbing, electrical, carpentry) can have varying standard costs associated with them due to differences in labor and material requirements.
ABANDON_REASON_DESC: If a job is abandoned, the reason might be related to cost (e.g., budget constraints), which could be an important predictor for the total repair cost. #########################################################################################################################
"Task Completion Time" as a New Target VariableΒΆ
Note- Feature Engineered from variables ("Date Logged" and "Date Comp") in our analysis and modelling
Advantages :ΒΆ
- Operational Efficiency:
Predicting Duration: Understanding how long different types of repairs take can help in scheduling and resource allocation. Workflow Optimization: Identifying factors that lead to delays can improve overall operational efficiency.
Resource Allocation: Staffing and Scheduling: Predicting the time required for tasks can aid in better staffing and scheduling of repair crews.
Contractor Efficiency: Analyzing contractor performance in terms of completion time as well as cost.
Customer Satisfaction and Expectation Management: Being able to predict and communicate accurate completion times can enhance tenant satisfaction.
Service Quality: Faster completion times might correlate with better service quality or vice versa.
######################################################################################################
Potential future Data Enhancements/Augmentation (for Predictors and Response) for more accurate prediction of repair value or other responses.ΒΆ
Detailed Repair Breakdown:ΒΆ
Instead of a single 'Total Value', itemizing the costs into labor, parts, and additional expenses could provide deeper insights and allow for more granular analysis and cost optimization.
Quality of Repairs:ΒΆ
Collecting follow-up data on the quality and durability of the repairs could help in assessing the performance of contractors/managements and predicting future repair costs.
Demographics:ΒΆ
Incorporating demographicic information, such as the location's socioeconomic status or urban versus rural setting, could reveal trends related to repair costs across different regions.
Frequency of Repairs:ΒΆ
Tracking how often repairs are required for the same issue or property could indicate underlying problems that are not apparent from a single repair. Predicting the time until the next repair could provide a broader perspective on service quality.
Regulatory or Policy Changes:ΒΆ
Any changes in building codes, safety regulations, or environmental standards could affect repair costs over time and should be recorded.
Market Price Trends:ΒΆ
Including data on the cost of materials and labor market rates over time would help adjust for economic fluctuations and inflationary costs that might impact repair costs.
More Historical Data:ΒΆ
Extending the dataset beyond 18 months could reveal long-term trends and seasonality aspects that could allow for advanced time series analysis to forecast repair costs based on patterns and budget accordingly.
Contractor and Management Performance:ΒΆ
Metrics on the performance of managements and contractors could be useful in the prediction models to correlate between repair costs and competancy skills that would allow for proper resource planning.
Tenant Feedback:ΒΆ
In cases of rental properties, tenant feedback on repairs could provide additional context on the urgency and necessity of repair which would potentially improve repair costs in the form of reduced abandonments and other associated factors.
Customer satisfaction Index:ΒΆ
Measuring customer satisfaction or sentiment over time and integrating that into features could potentially improve repair costs by measuring the management and contractor's performance.
Predictor "Task Completion Time" further recommended analysisΒΆ
Direct Cost Implications:ΒΆ
In housing repairs, the time taken to complete a task is often directly proportional to the cost. Longer tasks might require more labor hours, extended use of equipment, or prolonged disruptions that can increase the overall cost. Including "Task_completion_time" allows the model to factor in these potential cost escalations.
Complexity and Urgency Indicator:ΒΆ
Longer completion times might indicate more complex or urgent repairs, which are typically more expensive. This feature can serve as a proxy for the complexity and urgency of a job, both of which are critical factors in cost estimation.
Resource Allocation and Efficiency:ΒΆ
The time taken to complete a task can also reflect the efficiency and resource allocation of the contractor. Efficient contractors who complete tasks quicker might incur lower costs due to optimized resource usage.
Predictive Power in Different Models (As observed in our model):ΒΆ
The different ways in which XGBoost and Random Forest handle features can explain why "Task_completion_time" improves performance in one but not the other. XGBoost is particularly effective at capturing complex, non-linear relationships and interactions between features. Therefore, it might be better at utilizing "Task_completion_time" in conjunction with other features to predict repair costs accurately. On the other hand, Random Forest, while powerful, might not leverage this specific feature as effectively, especially if it does not significantly contribute to reducing variance in the predictions.
In summary, "Task_completion_time" likely adds important information regarding the cost drivers in housing repairs, such as labor intensity, complexity, and urgency. Its impact on model performance can vary based on the modeling technique used, with XGBoost potentially being more adept at exploiting this feature's predictive value.
Recommendations:ΒΆ
Please note, this is supplemetary in addition to enhanced scope of data augmentation mentioned above.ΒΆ
Optimizing Contractor Selection based on performance (Performance Tracker):ΒΆ
Segregation of skilled and unskilled workforce and their improvements over time would ultimately allow us to match the supply demand ratio, and better optimal allocation, reduced potential abandonment rates, and redundant repair jobs etc.
Preventive Maintenance:ΒΆ
Identifying trends in repair types could lead to preventive measures, reducing long-term costs.
Budget Allocation based on material and contractor costs at a granular level:ΒΆ
Keeping track of material costs considerting inflationary pressures and economic trends could allow us to make a better planning of the overall budget.
Property Maintenance TrackerΒΆ
Insights into which properties or areas incur higher costs can help in better budget planning.
Management and Contractor workload optimization and allocation:ΒΆ
This would allow us to allocate resources in a more effective manner potentially leading to improvement in abandonment rates and increased customer satisfaction and thus leading to reduction in redundant, avoidable or missed repairs.
Continuous Market Analysis:ΒΆ
Continuously analyze the housing market for trends that could impact the business.
Customer/Tenant SatisfactionΒΆ
This is the overarching metric and the ultimate goal which ultimately results from all that has been discussed so far above. Though this is a non-quantifiable metric unlike repair cost or repair timelines, but this is an extremely important metric that allows us to go into the logical conclusion of different other factors driving up repair costs.
###############################################################################################################################################
Some of the factors that we have observed in the predictor variables):ΒΆ
Cost Drivers:ΒΆ
Identify key factors that drive repair costs up for specific types of repairs or properties (like "Terrace", "End Terrace" etc.).
Management Allocation:ΒΆ
Balanced distribution of work portfolio (not in a skewed manner as we observed in our data with "MA1" having an extremely large volume of jobs.
Contractor Performance and optimal allocation:ΒΆ
Similarly, equal workload distribution and performance monitoting and management to optimize repair costs (not in a skewed manner as we observed with contractor "N/A" having a disproportionate volume of requests)
Abandonment Reason and Risk management:ΒΆ
Potential impact of skewed allocation of management, contractors, job types and others have impact on job abandonment and subsequent longer job completion times, thus driving up the repair costs.
Risk Management from property type and type of repairs perspective:ΒΆ
Identify properties or types of repairs that represent higher financial risks in terms of repair cost due to the intrinsic nature of these skilled jobs and types of properties.
##########################################################################################################################################
This is an extra modelling initiative without any concrete outcomes.ΒΆ
Time Series AnalysisΒΆ
Objectives:ΒΆ
The objective of this time series analysis is to forecast future repair costs for properties in the fictional county of Borsetshire.
Analyzing historical data on repair costs (based on job types and peroperty types) , the model aims to predict future expenses, which can be crucial for budgeting, resource allocation, and strategic planning.
How this anyalis would potentially help in future?ΒΆ
The inclusion of seasonality and exogenous variables like job types and property types adds depth to the analysis, allowing for a more nuanced understanding of what drives repair costs. Financial Planning: Accurate forecasts enable better budget allocation and financial planning.
Resource Allocation: Anticipating high-cost periods helps in allocating resources efficiently.
Strategic Decision-Making: Understanding cost drivers aids in making informed strategic decisions.
Maintenance Scheduling: Identifying seasonal trends can guide proactive maintenance scheduling to prevent costly repairs.
Contractor Management: Insights into job types associated with higher costs can inform contractor negotiations and management.
Why this Analysis:ΒΆ
1- Seasonal Trends: If the model indicates specific seasons with higher repair costs, it suggests a need for proactive maintenance during these periods to mitigate potential issues.
2- Impact of Property and Job Types: The significance of certain property types or job types in driving costs can inform targeted maintenance strategies. For instance, if certain property types are more prone to expensive repairs, they could be prioritized for inspections or preventative maintenance.
3- Forecasting Uncertainty: The wide confidence intervals in the forecast suggest considerable uncertainty, indicating the need for a flexible and responsive approach to budgeting and resource allocation.
4- Continuous Monitoring and Model Refinement: As new data becomes available, continuously updating and refining the model will improve its accuracy and reliability. This also involves reassessing the impact of external factors regularly.
5- Data-Driven Decision-Making: The analysis emphasizes the importance of a data-driven approach in managing repairs and maintenance, ensuring decisions are based on empirical evidence rather than assumptions.
Overall, the time series analysis provides a foundation for making informed decisions about future repair needs and costs, helping to optimize operations and financial management in Borsetshire's property maintenance sector.
Data Pre-processing for Time Series AnalysisΒΆ
# Creating a copy of the dataframe
data_copy = Int_df_merged.copy()
# Data Preprocessing
# Handling missing values
numeric_cols = data_copy.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = data_copy.select_dtypes(include=['object']).columns.tolist()
for col in numeric_cols:
data_copy[col].fillna(data_copy[col].median(), inplace=True)
for col in categorical_cols:
data_copy[col].fillna(data_copy[col].mode()[0], inplace=True)
# Creating a time series dataset for repair cost prediction
data_copy['Date Logged'] = pd.to_datetime(data_copy['Date Logged'], infer_datetime_format=True, errors='coerce')
data_copy['Month'] = data_copy['Date Logged'].dt.month
data_copy['Year'] = data_copy['Date Logged'].dt.year
time_series_data = data_copy.groupby(['Year', 'Month'])['Total Value'].sum().reset_index()
time_series_data['Date'] = pd.to_datetime(time_series_data[['Year', 'Month']].assign(DAY=1))
time_series_data = time_series_data.set_index('Date')['Total Value']
time_series_data = time_series_data.asfreq('MS')
Data Stationarity CheckΒΆ
1- (Augmented Dickey-Fuller Test):
Augmented Dickey-Fuller (ADF) test ----------statistical test for stationarity.
1- The null hypothesis of the test is that the time series is non-stationary. 2- If the p-value is less than a significance level (e.g., 0.05), you can reject the null hypothesis.
2- Checking Trend, Seasonality and Residual components on non-stationarity check
The plots above illustrate the decomposition of the original time series data into its core components: trend, seasonality, and residuals. This decomposition is helpful in understanding different aspects of the time series:
Original Time Series (Top Plot): Shows the actual data as it was recorded. It's the starting point for analyzing the time series.
Trend Component (Second Plot): Represents the long-term progression of the series, showing how the data moves upwards or downwards over time. A consistent upward or downward trend indicates non-stationarity.
Seasonal Component (Third Plot): This captures the regular pattern of variability within the time series. For instance, specific peaks or troughs that repeat at the same time each year.
Residuals (Bottom Plot): These are what remains after the trend and seasonal components have been removed. Ideally, residuals should be random "noise" β if they still contain some structure, it indicates that the model has not fully captured all aspects of the time series data.
Decomposing a time series is particularly useful for understanding non-stationarity. If either the trend or seasonal components are very pronounced, they can be the reason why a time series is non-stationary. Addressing non-stationarity is often necessary before performing further time series forecasting.
result = adfuller(time_series_data)
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:', result[4])
if result[1] <= 0.05:
print("The time series is stationary.")
else:
print("The time series is non-stationary.")
# Decomposing the time series to analyze trend, seasonality, and residuals
decomposition = seasonal_decompose(time_series_data, model='additive', period=6)
# Extracting the trend, seasonal, and residual components
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
# Plotting the original data, trend, seasonality, and residuals
plt.figure(figsize=(14, 8))
plt.subplot(411)
plt.plot(time_series_data, label='Original')
plt.legend(loc='best')
plt.title('Original Time Series')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.title('Trend Component')
plt.subplot(413)
plt.plot(seasonal,label='Seasonality')
plt.legend(loc='best')
plt.title('Seasonal Component')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.title('Residuals')
plt.tight_layout()
plt.show()
ADF Statistic: -2.178108163946107
p-value: 0.21428664638431927
Critical Values: {'1%': -3.889265672705068, '5%': -3.0543579727254224, '10%': -2.66698384083045}
The time series is non-stationary.
Non-stationarity Removal ProcessΒΆ
1- Differencing the Time-series (1st Order)
We can see that the data is still non-stationary after 1st order differencing.
time_series_diff = time_series_data.diff().dropna()
result = adfuller(time_series_diff)
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:', result[4])
if result[1] <= 0.05:
print("The time series is stationary.")
else:
print("The time series is non-stationary.")
# Decomposing the time series to analyze trend, seasonality, and residuals
decomposition = seasonal_decompose(time_series_diff, model='additive', period=6)
# Extracting the trend, seasonal, and residual components
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
# Plotting the original data, trend, seasonality, and residuals
plt.figure(figsize=(14, 8))
plt.subplot(411)
plt.plot(time_series_data, label='Original')
plt.legend(loc='best')
plt.title('Original Time Series')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.title('Trend Component')
plt.subplot(413)
plt.plot(seasonal,label='Seasonality')
plt.legend(loc='best')
plt.title('Seasonal Component')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.title('Residuals')
plt.tight_layout()
plt.show()
ADF Statistic: -2.0359197131142412
p-value: 0.2710489387045575
Critical Values: {'1%': -3.9644434814814815, '5%': -3.0849081481481484, '10%': -2.6818144444444445}
The time series is non-stationary.
#Differencing the 1st Order time series again to male it stationary (2nd Order differencing)
time_series_diff_2nd_order = time_series_data.diff().diff().dropna()
result = adfuller(time_series_diff_2nd_order)
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:', result[4])
if result[1] <= 0.05:
print("The time series is stationary.")
else:
print("The time series is non-stationary.")
# Decomposing the time series to analyze trend, seasonality, and residuals
decomposition = seasonal_decompose(time_series_diff_2nd_order, model='additive', period=6)
# Extracting the trend, seasonal, and residual components
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
# Plotting the original data, trend, seasonality, and residuals
plt.figure(figsize=(14, 8))
plt.subplot(411)
plt.plot(time_series_data, label='Original')
plt.legend(loc='best')
plt.title('Original Time Series')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.title('Trend Component')
plt.subplot(413)
plt.plot(seasonal,label='Seasonality')
plt.legend(loc='best')
plt.title('Seasonal Component')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.title('Residuals')
plt.tight_layout()
plt.show()
ADF Statistic: -5.7808769345747395
p-value: 5.129460391231545e-07
Critical Values: {'1%': -3.9240193847656246, '5%': -3.0684982031250003, '10%': -2.67389265625}
The time series is stationary.
# Define the p, d, and q parameters to take any value from 0 to 1
p = d = q = range(0, 2)
# Generate all different combinations of p, d, and q triplets
pdq = list(itertools.product(p, d, q))
best_aic = float("inf")
best_params = None
# Iterate over all the possible combinations of parameters
for param in pdq:
try:
model = ARIMA(time_series_data, order=param) # Note: using the original time series data, not differenced
results = model.fit()
if results.aic < best_aic:
best_aic = results.aic
best_params = param
except:
continue
print("Best ARIMA parameters:", best_params)
Best ARIMA parameters: (0, 1, 0)
C:\Users\dmish\anaconda3\lib\site-packages\statsmodels\tsa\statespace\sarimax.py:966: UserWarning: Non-stationary starting autoregressive parameters found. Using zeros as starting parameters. C:\Users\dmish\anaconda3\lib\site-packages\statsmodels\tsa\statespace\sarimax.py:978: UserWarning: Non-invertible starting MA parameters found. Using zeros as starting parameters.
The output "Best ARIMA parameters: (1, 1, 0)" from the ARIMA model parameter search indicates the following:
1- p (AR order) = 1: This suggests that the best-fitting model includes one autoregressive term. In other words, the model predicts future values based on one lagged (previous) value of the series.
2- d (Differencing order) = 1: This indicates that the data requires first-order differencing to make it stationary. The model will automatically difference the data once before fitting the AR and MA components.
3- q (MA order) = 0: This means that the model does not use any moving average components. There are no lagged forecast errors in the prediction equation.
Interpretation:ΒΆ
The ARIMA(1,1,0) model - It suggests that the current value of the series is based on its immediately previous value and a trend component (due to the differencing). The lack of a moving average component implies that the model does not incorporate the error terms of the previous predictions into the current prediction.
# Split data into a training set and a test set
train_size = int(len(time_series_data) * 0.8) # 80% for training, 20% for testing
train, test = time_series_data[:train_size], time_series_data[train_size:]
# Fit the ARIMA(1, 1, 0) model to the training data
model = ARIMA(train, order=(1, 1, 0))
results = model.fit()
# Make predictions on the test set
predictions = results.forecast(steps=len(test))
# Calculate the Mean Squared Error (MSE) to evaluate model performance
mse = mean_squared_error(test, predictions)
print("Mean Squared Error (MSE):", mse)
# You can also visualize the actual vs. predicted values
import matplotlib.pyplot as plt
plt.plot(test.index, test.values, label="Actual")
plt.plot(test.index, predictions, label="Predicted")
plt.legend()
plt.xlabel("Date")
plt.xticks(rotation=45)
plt.ylabel("Total Value")
plt.title("ARIMA(1, 1, 0) Model: Actual vs. Predicted")
plt.show()
# To make future forecasts, you can use the 'forecast' method
forecast_steps = 12 # Adjust as needed
future_forecast = results.forecast(steps=forecast_steps)
# The 'future_forecast' contains the forecasted values for the next 'forecast_steps' periods
print("Future Forecast:", future_forecast)
Mean Squared Error (MSE): 9492491101.959797
Future Forecast: 2023-09-01 222795.387645 2023-10-01 223402.544991 2023-11-01 223292.322961 2023-12-01 223312.332429 2024-01-01 223308.699954 2024-02-01 223309.359386 2024-03-01 223309.239674 2024-04-01 223309.261406 2024-05-01 223309.257461 2024-06-01 223309.258177 2024-07-01 223309.258047 2024-08-01 223309.258071 Freq: MS, Name: predicted_mean, dtype: float64
# Calculate MAE
mae = mean_absolute_error(test, predictions)
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(test, predictions))
# Print MAE and RMSE
print("Mean Absolute Error (MAE):", mae)
print("Root Mean Squared Error (RMSE):", rmse)
# Visualize residuals
residuals = test - predictions
plt.figure(figsize=(10, 4))
plt.plot(test.index, residuals, label="Residuals")
plt.axhline(0, color="red", linestyle="--", label="Zero Residuals")
plt.xlabel("Date")
plt.ylabel("Residuals")
plt.title("Residual Analysis")
plt.legend()
plt.show()
Mean Absolute Error (MAE): 76244.94568870834 Root Mean Squared Error (RMSE): 97429.41599927506
# Fit the ARIMA(1,1,0) model
model = ARIMA(time_series_data, order=(1, 1, 0))
results = model.fit()
# You can then use results.summary() to get a summary of the model
print(results.summary())
# For forecasting future values
# forecast = results.get_forecast(steps=number_of_future_steps)
# predicted_values = forecast.predicted_mean
# confidence_intervals = forecast.conf_int()
SARIMAX Results
==============================================================================
Dep. Variable: Total Value No. Observations: 19
Model: ARIMA(1, 1, 0) Log Likelihood -220.228
Date: Tue, 19 Dec 2023 AIC 444.455
Time: 10:44:24 BIC 446.236
Sample: 06-01-2022 HQIC 444.701
- 12-01-2023
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.1570 0.190 0.827 0.408 -0.215 0.529
sigma2 2.586e+09 3.93e-11 6.59e+19 0.000 2.59e+09 2.59e+09
===================================================================================
Ljung-Box (L1) (Q): 0.06 Jarque-Bera (JB): 3.75
Prob(Q): 0.80 Prob(JB): 0.15
Heteroskedasticity (H): 5.38 Skew: -1.05
Prob(H) (two-sided): 0.06 Kurtosis: 3.79
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 1.56e+35. Standard errors may be unstable.
# Define the p, d, and q parameters to take any value from 0 to 1
p = d = q = range(0, 2)
# Generate all different combinations of p, d, and q triplets
pdq = list(itertools.product(p, d, q))
best_aic = float("inf")
best_params = None
# Iterate over all the possible combinations of parameters
for param in pdq:
try:
model = SARIMAX(time_series_diff_2nd_order, order=param) # Using the differenced time series
results = model.fit()
if results.aic < best_aic:
best_aic = results.aic
best_params = param
except:
continue
print("Best SARIMAX parameters:", best_params)
C:\Users\dmish\anaconda3\lib\site-packages\statsmodels\tsa\statespace\sarimax.py:966: UserWarning: Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.
Best SARIMAX parameters: (1, 1, 0)
# Splitting the data into train and test sets
train_size = int(len(time_series_diff_2nd_order) * 0.8)
train, test = time_series_data[:train_size], time_series_data[train_size:]
# Define the SARIMAX model with the best parameters
best_order = (1, 1, 1)
model = SARIMAX(train, order=best_order)
results = model.fit()
# Forecasting using the test set
forecast = results.get_forecast(steps=len(test))
forecast_mean = forecast.predicted_mean
# Calculating diagnostic metrics
mse = mean_squared_error(test, forecast_mean)
mae = mean_absolute_error(test, forecast_mean)
rmse = np.sqrt(mse)
print("Mean Squared Error (MSE):", mse)
print("Mean Absolute Error (MAE):", mae)
print("Root Mean Squared Error (RMSE):", rmse)
# Printing the forecasted repair costs for different dates
forecast_dates = forecast_mean.index
forecasted_costs = forecast_mean.values
for date, cost in zip(forecast_dates, forecasted_costs):
print(f"Date: {date}, Forecasted Cost: {cost:.2f}")
# Plotting the actual and predicted values on the same time scale
plt.figure(figsize=(12, 6))
plt.plot(time_series_data.index, time_series_data, label="Actual")
plt.plot(test.index, forecast_mean, label="Predicted", color='red')
plt.legend()
plt.xlabel("Date")
plt.ylabel("Total Value")
plt.title("Actual vs. Predicted Values")
plt.show()
Mean Squared Error (MSE): 6643785782.092053 Mean Absolute Error (MAE): 54992.487680116064 Root Mean Squared Error (RMSE): 81509.4214314643 Date: 2023-07-01 00:00:00, Forecasted Cost: 231222.81 Date: 2023-08-01 00:00:00, Forecasted Cost: 228185.02 Date: 2023-09-01 00:00:00, Forecasted Cost: 230061.20 Date: 2023-10-01 00:00:00, Forecasted Cost: 228902.45 Date: 2023-11-01 00:00:00, Forecasted Cost: 229618.11 Date: 2023-12-01 00:00:00, Forecasted Cost: 229176.11
#Inference - The ARIMA prediction line is flat, predicting almost constant forecasted total repair cost on test data (between 023-07-01 to Date: 2023-12-01)
#This result is completely off-track, and the model does not perform as also evidenced by doagnostic metrics(MSE,MAE, RMSE)
Time Series AnalysisΒΆ
Check for Data StationarityΒΆ
repair_ts = Int_df_merged['Date Logged'].value_counts().sort_index().reset_index()
# repair_ts.columns = ['Date Comp', 'Count']
repair_ts.columns = ['Date Logged', 'Count']
repair_ts.set_index('Date Logged', inplace=True)
repair_ts.index = pd.to_datetime(repair_ts.index)
repair_ts.info()
# repair_ts = Int_df_merged.resample('M').size()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 548 entries, 2022-06-09 to 2023-12-08 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Count 548 non-null int64 dtypes: int64(1) memory usage: 8.6 KB
Imputation:ΒΆ
1- Replacing with mean or median If you have a time series, one common approach is to impute missing values with the mean or median of the existing values. You can use the fillna method for this:
2-Dropping Missing Values: If the missing values are limited and won't significantly impact the analysis, you may choose to drop rows with missing values:
This approach is suitable when the number of missing values is relatively small.
3-Interpolation: Another approach is to interpolate missing values based on the surrounding data points. This is useful when you want to maintain the overall trend of the time series:
print(repair_ts.isnull().sum())
print(np.isfinite(repair_ts).all())
Count 0 dtype: int64 Count True dtype: bool
Original Trend on Initial Original DataΒΆ
plt.figure(figsize=(12, 6))
plt.plot(repair_ts.index, repair_ts['Count'], label='Number of Repairs')
plt.title('Time Series Plot')
plt.xlabel('Date')
plt.ylabel('Number of Repairs')
# Set x-axis ticks at intervals of every 3 months
locator = MonthLocator(interval=3)
plt.gca().xaxis.set_major_locator(locator)
plt.xticks(rotation=45) # Rotates the date labels for better readability
plt.show()
Observations:ΒΆ
- The original data shows random fluctuations in no of repairs based on reapr logged date.
Rolling Statistics:ΒΆ
Rolling mean and rolling standard deviation to observe any trends or changes in variability.
rolling_mean = repair_ts['Count'].rolling(window=12).mean()
rolling_std = repair_ts['Count'].rolling(window=12).std()
plt.figure(figsize=(12, 6))
plt.plot(repair_ts['Count'], label='Original')
plt.plot(rolling_mean, label='Rolling Mean')
plt.plot(rolling_std, label='Rolling Std')
plt.title('Rolling Mean and Standard Deviation')
plt.xlabel('Date')
plt.ylabel('Number of Repairs')
plt.legend()
plt.show()
Augmented Dickey-Fuller Test:ΒΆ
Perform the Augmented Dickey-Fuller (ADF) test, a statistical test for stationarity. The null hypothesis of the test is that the time series is non-stationary. If the p-value is less than a significance level (e.g., 0.05), you can reject the null hypothesis.
result = adfuller(repair_ts['Count'].values)
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:', result[4])
if result[1] <= 0.05:
print("The time series is stationary.")
else:
print("The time series is non-stationary.")
ADF Statistic: -3.3903292964777347
p-value: 0.011292996298793828
Critical Values: {'1%': -3.442678467240966, '5%': -2.8669778698997543, '10%': -2.5696661916864083}
The time series is stationary.
print(repair_ts.isnull().sum())
print(np.isfinite(repair_ts).all())
Count 0 dtype: int64 Count True dtype: bool
Decomosing the original time series into trend, seasonal and residual components to observe any patternΒΆ
# Decompose the time series into trend, seasonal, and residual components
decomposition = sm.tsa.seasonal_decompose(repair_ts['Count'].values, model='additive', period=12) # Adjust the period as needed
# Plot the original time series
plt.figure(figsize=(12, 8))
plt.subplot(4, 1, 1)
plt.plot(repair_ts.index, repair_ts['Count'], label='Original Time Series')
plt.title('Original Time Series')
plt.xlabel('Date')
plt.ylabel('Number of Repairs')
plt.legend()
# Plot the trend component
plt.subplot(4, 1, 2)
plt.plot(repair_ts.index, decomposition.trend, label='Trend Component', color='orange')
plt.title('Trend Component')
plt.xlabel('Date')
plt.ylabel('Number of Repairs')
plt.legend()
# Plot the seasonal component
plt.subplot(4, 1, 3)
plt.plot(repair_ts.index, decomposition.seasonal, label='Seasonal Component', color='green')
plt.title('Seasonal Component')
plt.xlabel('Date')
plt.ylabel('Number of Repairs')
plt.legend()
# Plot the residual component
plt.subplot(4, 1, 4)
plt.plot(repair_ts.index, decomposition.resid, label='Residual Component', color='red')
plt.title('Residual Component')
plt.xlabel('Date')
plt.ylabel('Number of Repairs')
plt.legend()
plt.tight_layout()
plt.show()
print(repair_ts.isnull().sum())
print(np.isfinite(repair_ts).all())
repair_ts
Count 0 dtype: int64 Count True dtype: bool
| Count | |
|---|---|
| Date Logged | |
| 2022-06-09 | 34 |
| 2022-06-10 | 30 |
| 2022-06-11 | 8 |
| 2022-06-12 | 6 |
| 2022-06-13 | 49 |
| ... | ... |
| 2023-12-04 | 65 |
| 2023-12-05 | 52 |
| 2023-12-06 | 78 |
| 2023-12-07 | 62 |
| 2023-12-08 | 6 |
548 rows Γ 1 columns
# Check stationarity
result = adfuller(repair_ts['Count'])
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:', result[4])
if result[1] > 0.05:
print("The time series is non-stationary. Applying differencing...")
repair_ts_diff = repair_ts.diff().dropna()
else:
print("The time series is stationary.")
repair_ts_diff = repair_ts
# Plot the time series
plt.figure(figsize=(12, 6))
plt.plot(repair_ts_diff.index, repair_ts_diff['Count'], label='Number of Repairs')
plt.title('Time Series Plot')
plt.xlabel('Date')
plt.ylabel('Number of Repairs')
plt.legend()
plt.show()
ADF Statistic: -3.3903292964777347
p-value: 0.011292996298793828
Critical Values: {'1%': -3.442678467240966, '5%': -2.8669778698997543, '10%': -2.5696661916864083}
The time series is stationary.
print(repair_ts.isnull().sum())
print(np.isfinite(repair_ts).all())
repair_ts.info()
Count 0 dtype: int64 Count True dtype: bool <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 548 entries, 2022-06-09 to 2023-12-08 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Count 548 non-null int64 dtypes: int64(1) memory usage: 8.6 KB
print(repair_ts.index)
print(type(repair_ts.index))
DatetimeIndex(['2022-06-09', '2022-06-10', '2022-06-11', '2022-06-12',
'2022-06-13', '2022-06-14', '2022-06-15', '2022-06-16',
'2022-06-17', '2022-06-18',
...
'2023-11-29', '2023-11-30', '2023-12-01', '2023-12-02',
'2023-12-03', '2023-12-04', '2023-12-05', '2023-12-06',
'2023-12-07', '2023-12-08'],
dtype='datetime64[ns]', name='Date Logged', length=548, freq=None)
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
print(train.index)
print(train.index.isnull().sum())
print(test.index)
print(test.index.isnull().sum())
DatetimeIndex(['2022-06-01', '2022-07-01', '2022-08-01', '2022-09-01',
'2022-10-01', '2022-11-01', '2022-12-01', '2023-01-01',
'2023-02-01', '2023-03-01', '2023-04-01', '2023-05-01',
'2023-06-01'],
dtype='datetime64[ns]', name='Date', freq='MS')
0
DatetimeIndex(['2023-07-01', '2023-08-01', '2023-09-01', '2023-10-01',
'2023-11-01', '2023-12-01'],
dtype='datetime64[ns]', name='Date', freq='MS')
0
SARIMA ( Seasonal Autoregressive Integrated Moving Average) Model ForecastingΒΆ
Note - Please note, This is just an modelling attempt.ΒΆ
- Model (SARIMA) should be not be used here with our "Repairs";
Our data does not have any seasonal patterns or trends in addition to being a limited dataset of 18 months. This should be applied to time series data only.
repair_ts.index = pd.to_datetime(repair_ts.index)
# Explicitly set the frequency of the time series index
repair_ts = repair_ts.asfreq(freq='D')
# Check stationarity
result = adfuller(repair_ts['Count'])
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:', result[4])
if result[1] > 0.05:
print("The time series is non-stationary. Applying differencing...")
repair_ts_diff = repair_ts.diff().dropna()
else:
print("The time series is stationary.")
repair_ts_diff = repair_ts
# Split the data into training and testing sets
train_size = int(len(repair_ts_diff) * 0.8)
train, test = repair_ts_diff[:train_size], repair_ts_diff[train_size:]
# Fit SARIMA model
order = (1, 1, 1)
# Assuming monthly seasonality (12 months)
seasonal_order = (1, 1, 1, 12)
model = SARIMAX(train, order=order, seasonal_order=seasonal_order, enforce_stationarity=False, enforce_invertibility=False)
fit_model = model.fit(disp=False)
# Forecast
forecast = fit_model.get_forecast(steps=len(test))
predicted_values = forecast.predicted_mean
# Plotting
fig, ax = plt.subplots(figsize=(12, 8))
# Plotting Training Data
ax.plot(train.index, train, label='Training Data', color='blue')
# Plotting Test Data
ax.plot(test.index, test, label='Test Data', color='green')
# Plotting Forecasted Data
ax.plot(test.index, predicted_values, color='red', label='Forecasted Data')
# Add labels and legend
plt.title('SARIMA Forecasting')
plt.xlabel('Date')
plt.ylabel('Differenced Number of Repairs')
plt.legend()
plt.tight_layout()
plt.show()
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(test, predicted_values))
print('Root Mean Squared Error (RMSE):', rmse)
ADF Statistic: -3.3903292964777347
p-value: 0.011292996298793828
Critical Values: {'1%': -3.442678467240966, '5%': -2.8669778698997543, '10%': -2.5696661916864083}
The time series is stationary.
Root Mean Squared Error (RMSE): 22.502452812001987
#The data is non-stationary as confirmed above.
#If the rolling mean of the time series exhibits an increasing trend, and the rolling standard deviation has a decreasing trend, it suggests that the time series may not be stationary. Stationarity is a key assumption for many time series models, including SARIMA.
#Here are a few steps you can take to address the non-stationarity:
#Differencing:
#Apply differencing to the time series to remove the trend.
#You can start with first-order differencing (d=1) and seasonal differencing (D=1) if data exhibits seasonality.
#Check Stationarity After Differencing:
#Plot the differenced time series and check the rolling mean and standard deviation again.
#You can also rerun the Augmented Dickey-Fuller test to confirm stationarity.
#First Order Seasonal Differencing:
#First Order Seasonal Differencing:
#This line performs first-order seasonal differencing by taking the difference between the original time series (repair_ts)
#and the values from the same month in the previous year.
diff_ts_seasonal = repair_ts.diff(periods=12).dropna()
# Perform ADF test on the seasonally differenced series
result_diff_seasonal = adfuller(diff_ts_seasonal)
print('ADF Statistic after seasonal differencing:', result_diff_seasonal[0])
print('p-value after seasonal differencing:', result_diff_seasonal[1])
print('Critical Values:', result_diff_seasonal[4])
if result_diff_seasonal[1] <= 0.05:
print("The time series is stationary.")
else:
print("The time series is non-stationary.")
# Check for missing values in the seasonally differenced series
if diff_ts_seasonal.isnull().sum().sum() == 0: # Corrected this line
# Calculate rolling mean and standard deviation for seasonally differenced series
rolling_mean_seasonal = diff_ts_seasonal.rolling(window=3).mean() # Adjust window size
rolling_std_seasonal = diff_ts_seasonal.rolling(window=3).std() # Adjust window size
# Plot the seasonally differenced time series, rolling mean, and rolling standard deviation
plt.figure(figsize=(12, 6))
plt.plot(diff_ts_seasonal, label='Seasonally Differenced Time Series')
plt.plot(rolling_mean_seasonal, label='Rolling Mean (Seasonal)', color='red')
plt.plot(rolling_std_seasonal, label='Rolling Std (Seasonal)', color='green')
plt.title('Seasonally Differenced Time Series with Rolling Mean and Standard Deviation')
plt.xlabel('Date')
plt.ylabel('Seasonally Differenced Number of Repairs')
plt.legend()
plt.show()
else:
print("There are missing values in the seasonally differenced series.")
ADF Statistic after seasonal differencing: -6.3539600667369145
p-value after seasonal differencing: 2.569970471178525e-08
Critical Values: {'1%': -3.443061925077973, '5%': -2.8671466525252014, '10%': -2.5697561378507907}
The time series is stationary.
#Higher-Order Differencing: Try applying higher-order differencing to the seasonally differenced series.
# diff_ts_seasonal' is seasonally differenced time series
diff_ts_seasonal_higher = diff_ts_seasonal.diff(periods=1).dropna()
# Perform ADF test on the higher-order differenced series
result_diff_higher = adfuller(diff_ts_seasonal_higher)
print('ADF Statistic after higher-order differencing:', result_diff_higher[0])
print('p-value after higher-order differencing:', result_diff_higher[1])
print('Critical Values:', result_diff_higher[4])
ADF Statistic after higher-order differencing: -10.872259061592617
p-value after higher-order differencing: 1.3590894437654341e-19
Critical Values: {'1%': -3.4431115411022146, '5%': -2.8671684899522023, '10%': -2.5697677754736543}
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
# First Order Seasonal Differencing:
diff_ts_seasonal = repair_ts.diff(periods=12).dropna()
# First Order Differencing on the Seasonally Differenced Series:
#This line performs first-order differencing on the result of the first line (diff_ts_seasonal).
#It takes the difference between consecutive observations in the seasonally differenced time series.
#So, the combined effect of both operations is akin to 2nd order differencing.
#The term "2nd order" is used because it involves differencing twice: once for seasonality and once for the resulting series.
diff_order = 1 # Adjusting the order of differencing as needed
diff_ts_seasonal_higher = diff_ts_seasonal.diff(periods=diff_order).dropna() # Applying first-order differencing again
# Performing ADF test on the higher-order differenced series
result_diff_higher = adfuller(diff_ts_seasonal_higher)
print('ADF Statistic after higher-order differencing:', result_diff_higher[0])
print('p-value after higher-order differencing:', result_diff_higher[1])
print('Critical Values:', result_diff_higher[4])
if result_diff_higher[1] <= 0.05:
print("The time series is stationary.")
else:
print("The time series is non-stationary.")
# Calculate rolling mean and standard deviation for higher-order differenced series
rolling_mean_higher = diff_ts_seasonal_higher.rolling(window=6, center=False).mean()
rolling_std_higher = diff_ts_seasonal_higher.rolling(window=6, center=False).std()
# Plot the higher-order differenced time series, rolling mean, and rolling standard deviation
plt.figure(figsize=(12, 6))
plt.plot(diff_ts_seasonal_higher, label='Higher-Order Differenced Time Series')
plt.plot(rolling_mean_higher, label='Rolling Mean (Higher Order)', color='red')
plt.plot(rolling_std_higher, label='Rolling Std (Higher Order)', color='green')
plt.title('Higher-Order Differenced Time Series with Rolling Mean and Standard Deviation')
plt.xlabel('Date')
plt.ylabel('Higher-Order Differenced Number of Repairs')
plt.legend()
plt.show()
ADF Statistic after higher-order differencing: -10.872259061592617
p-value after higher-order differencing: 1.3590894437654341e-19
Critical Values: {'1%': -3.4431115411022146, '5%': -2.8671684899522023, '10%': -2.5697677754736543}
The time series is stationary.
# Analyze particular months or seasons with higher volume of repair jobs
# Convert Date Logged and Date Comp columns to datetime type
Int_df_merged['Date Logged'] = pd.to_datetime(Int_df_merged['Date Logged'])
Int_df_merged['Date Comp'] = pd.to_datetime(Int_df_merged['Date Comp'])
# Create a new DataFrame with Date Logged as the index
repair_df = Int_df_merged.set_index('Date Logged')
monthly_counts = repair_df.resample('M').size()
monthly_max = monthly_counts.idxmax()
print(f"The month with the highest number of repairs is {monthly_max.strftime('%B %Y')} with {monthly_counts.max()} repairs.")
# Plot the number of repairs over time
plt.figure(figsize=(12, 6))
repair_df.resample('M').size().plot(legend=False)
plt.title('Number of Repairs Over Time')
plt.xlabel('Date Logged')
plt.ylabel('Number of Repairs')
plt.show()
# Identify trends, seasonality, and patterns
# You can use more advanced methods such as decomposition if needed
# Example:
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(repair_df.resample('D').size())
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
# Plot the decomposed components
plt.figure(figsize=(12, 8))
plt.subplot(411)
plt.plot(repair_df.resample('M').size(), label='Original')
plt.legend(loc='upper left')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='upper left')
plt.subplot(413)
plt.plot(seasonal, label='Seasonal')
plt.legend(loc='upper left')
plt.subplot(414)
plt.plot(residual, label='Residual')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
The month with the highest number of repairs is March 2023 with 1478 repairs.
# Perform ADF test on the differenced series to check stationarity
#ADF Statistic: The more negative it is, the stronger the evidence against the null hypothesis of non-stationarity.
#p-value: A very low p-value (typically less than 0.05) indicates that you can reject the null hypothesis.
#In this case, it suggests that the time series is likely stationary
result_diff_higher = adfuller(diff_ts_seasonal_higher)
print('ADF Statistic after higher-order differencing:', result_diff_higher[0])
print('p-value after higher-order differencing:', result_diff_higher[1])
# If the series is stationary, proceed with manual decomposition
if result_diff_higher[1] <= 0.05:
# Create a time index based on the number of observations
time_index = pd.date_range(start='2023-01-01', periods=len(diff_ts_seasonal_higher), freq='M')
# Calculate the trend as the rolling mean
trend = rolling_mean_higher
# Calculate the seasonal component as the difference between the differenced series and the trend
seasonal = diff_ts_seasonal_higher - trend
# Calculate the residual component as the difference between the differenced series and the seasonal component
residual = diff_ts_seasonal_higher - seasonal
# Plot the components
plt.figure(figsize=(12, 12))
plt.subplot(411)
plt.plot(repair_ts, label='Original Time Series')
plt.legend(loc='upper left')
plt.subplot(412)
plt.plot(diff_ts_seasonal_higher, label='Differenced Time Series')
plt.legend(loc='upper left')
plt.subplot(413)
plt.plot(trend, label='Trend Component', color='red')
plt.legend(loc='upper left')
plt.subplot(414)
plt.plot(seasonal, label='Seasonal Component', color='green')
plt.plot(residual, label='Residual Component', color='blue')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
else:
print("The time series is still non-stationary after higher-order differencing.")
ADF Statistic after higher-order differencing: -10.872259061592617 p-value after higher-order differencing: 1.3590894437654341e-19
#SATIMA Differenced Time Series after higher order differencing
# Plot the time series
plt.figure(figsize=(12, 6))
plt.plot(diff_ts_seasonal_higher, label='Differenced Time Series')
plt.title('Differenced Time Series After Higher-Order Differencing')
plt.xlabel('Date')
plt.ylabel('Number of Repairs')
plt.legend()
plt.show()
# Decompose the time series into trend, seasonal, and residual components
decomposition = sm.tsa.seasonal_decompose(diff_ts_seasonal_higher)
# Plot the components
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
plt.figure(figsize=(12, 10))
plt.subplot(411)
plt.plot(diff_ts_seasonal_higher, label='Original')
plt.legend(loc='upper left')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='upper left')
plt.subplot(413)
plt.plot(seasonal, label='Seasonal')
plt.legend(loc='upper left')
plt.subplot(414)
plt.plot(residual, label='Residual')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
# Fit SARIMA model
# Example values are provided, but you may need to adjust them
p, d, q = 1, 1, 1 # ARIMA order
P, D, Q, m = 1, 1, 1, 12 # Seasonal order
model = sm.tsa.SARIMAX(diff_ts_seasonal_higher, order=(p, d, q), seasonal_order=(P, D, Q, m))
results = model.fit()
# Print model summary
print(results.summary())
# Plot diagnostics
results.plot_diagnostics(figsize=(12, 8))
plt.show()
# Forecast future values
forecast_steps = 12 # Adjust as needed
forecast = results.get_forecast(steps=forecast_steps)
forecast_ci = forecast.conf_int()
# Plot the forecast
plt.figure(figsize=(12, 6))
plt.plot(diff_ts_seasonal_higher, label='Observed')
plt.plot(forecast.predicted_mean, label='Forecast', color='red')
plt.fill_between(forecast_ci.index, forecast_ci.iloc[:, 0], forecast_ci.iloc[:, 1], color='red', alpha=0.2)
plt.title('SARIMA Forecast')
plt.xlabel('Date')
plt.ylabel('Number of Repairs')
plt.legend()
plt.show()
SARIMAX Results
==========================================================================================
Dep. Variable: Count No. Observations: 535
Model: SARIMAX(1, 1, 1)x(1, 1, 1, 12) Log Likelihood -2575.724
Date: Tue, 19 Dec 2023 AIC 5161.448
Time: 10:44:37 BIC 5182.737
Sample: 06-22-2022 HQIC 5169.786
- 12-08-2023
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 -0.0219 0.068 -0.325 0.746 -0.154 0.111
ma.L1 -0.9995 1.643 -0.608 0.543 -4.220 2.221
ar.S.L12 -0.5739 0.040 -14.220 0.000 -0.653 -0.495
ma.S.L12 -0.9991 3.286 -0.304 0.761 -7.439 5.441
sigma2 984.3485 3446.658 0.286 0.775 -5770.977 7739.674
===================================================================================
Ljung-Box (L1) (Q): 0.02 Jarque-Bera (JB): 2.36
Prob(Q): 0.89 Prob(JB): 0.31
Heteroskedasticity (H): 1.46 Skew: -0.07
Prob(H) (two-sided): 0.01 Kurtosis: 3.30
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
################################################################################################################
Please don't refer below 2-graphs as these graphs are just rough graphs for exploratory analysis done earlier.ΒΆ
# Create a DataFrame with selected numeric columns
numeric_columns = [
'Initial Priority', 'LATEST_PRIORITY', 'Job Status', 'Total Value', 'Year of Build Date'
]
numeric_df = Int_df_merged[numeric_columns]
# Create a dictionary to map numeric columns to their corresponding description columns
column_descriptions = {
'Initial Priority': 'Initial Priority Description',
'LATEST_PRIORITY': 'Latest Priority Description',
'Job Status': 'JOB_STATUS_DESCRIPTION',
'Total Value': None,
'Year of Build Date': None
}
# Plot histograms with labeled x-axis
fig, axes = plt.subplots(nrows=len(numeric_columns), ncols=1, figsize=(12, 5 * len(numeric_columns)))
for i, column in enumerate(numeric_columns):
description_column = column_descriptions.get(column, column)
if description_column is not None:
labels = Int_df_merged[description_column].astype(str).unique()
labels.sort()
bins = range(len(labels) + 1)
axes[i].hist(numeric_df[column], bins=bins, align='left', rwidth=0.8)
axes[i].set_title(f'Histogram: {description_column}')
axes[i].set_xlabel(column)
axes[i].set_ylabel('Frequency')
axes[i].set_xticks(range(len(labels)))
axes[i].set_xticklabels(labels, rotation=45, ha='right')
else:
axes[i].hist(numeric_df[column], bins=20)
axes[i].set_title(f'Histogram: {column}')
axes[i].set_xlabel(column)
axes[i].set_ylabel('Frequency')
plt.tight_layout()
plt.show()
# Select numeric columns
numeric_columns = [
'Initial Priority', 'LATEST_PRIORITY', 'Job Status', 'Total Value', 'Year of Build Date'
]
# Create a dictionary to map numeric codes to descriptions
column_descriptions = {
'Initial Priority': 'Initial Priority Description',
'LATEST_PRIORITY': 'Latest Priority Description',
'Job Status': 'JOB_STATUS_DESCRIPTION',
'Total Value': None,
'Year of Build Date': None
}
# Plot histograms with labeled x-axis and bell-shaped curve using Seaborn
fig, axes = plt.subplots(nrows=len(numeric_columns), ncols=1, figsize=(12, 5 * len(numeric_columns)))
for i, column in enumerate(numeric_columns):
description_column = column_descriptions.get(column, column)
if description_column is not None and description_column in Int_df_merged.columns:
# Ensure the data type is 'category'
Int_df_merged[description_column] = Int_df_merged[description_column].astype('category')
# Use Seaborn's histplot with multiple="layer" or omit multiple for non-stacked histograms
sns.histplot(data=Int_df_merged, x=description_column, bins=20, kde=True, ax=axes[i])
axes[i].set_title(f'Histogram: {description_column}')
axes[i].set_xlabel(description_column)
axes[i].set_ylabel('Frequency')
axes[i].tick_params(axis='x', rotation=90) # Rotate x-axis labels
else:
sns.histplot(Int_df_merged, x=column, bins=20, kde=True, ax=axes[i])
axes[i].set_title(f'Histogram: {column}')
axes[i].set_xlabel(column)
axes[i].set_ylabel('Frequency')
plt.tight_layout()
plt.show()